Model weights

def vit_huge(patch_size=16, **kwargs):
model = VisionTransformer(
patch_size=patch_size, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=5.3375,
qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
return model

But I am getting size mismatch"size mismatch for blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([1280, 3416]) from checkpoint, the shap
e in current model is torch.Size([1280, 6832])." I am passing in patch size=14 only.

Could you tell what could be wrong in the architecture?

Thanks,
Rohan

adamcasson

Paige AI org Jun 29, 2024

Hi Rohan,

It looks like the DINOv1 code you're using does not use a SwiGLU MLP layer which is why you see the shape mismatch error.

We have not validated this checkpoint on other ViT code bases and it's highly recommended to use the timm library for loading this checkpoint, i.e.:

model = timm.create_model("hf-hub:paige-ai/Virchow", pretrained=True, mlp_layer=timm.layers.SwiGLUPacked, act_layer=torch.nn.SiLU)

(full example: https://huggingface.co/paige-ai/Virchow#image-embeddings)

If timm doesn't work for your use case please let us know and we can try to help get the checkpoint working with other ViT code bases!

rbareja

Jun 29, 2024

Hi Adam,

Yes, getting a checkpoint compatible with other vit code bases and similar to checkpoints will be helpful, as I am extracting features from couple of pathology models and would be good to be consistent.

adamcasson

Paige AI org Jun 29, 2024

•

edited Jun 29, 2024

In the case of the DINOv1 code base you'll have to swap out the Mlp layer class with an appropriate SwiGLU implementation. Something like this may work (although I haven't validated it):

class SwiGluMlp(nn.Module):
    def __init__(
            self,
            in_features,
            hidden_features=None,
            out_features=None,
            act_layer=nn.SiLU,
            drop=0.,
            bias=True,
            gate_last=False,
    ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        assert hidden_features % 2 == 0

        self.chunk_dim = -1
        self.gate_last = gate_last

        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features // 2, out_features, bias=bias)

    def forward(self, x):
        x = self.fc1(x)
        x1, x2 = x.chunk(2, dim=self.chunk_dim)
        x = x1 * self.act(x2) if self.gate_last else self.act(x1) * x2
        x = self.fc2(x)
        return x

Please do use the timm implementation as a ground truth to verify the correctness of the model outputs if this works, there may be other differences besides just this layer.

adamcasson changed discussion status to closed Jul 24, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment