Model weights
Hello,
I submitted a request to download model weights. Could you please update on the request.
thanks,
Rohan
Thanks for your interests, Rohan. We should have granted the access. Please help confirm.
yes, I have acces. Thank you!
Hi Siqi ,
I am using this vision transformer code to create a vit_huge.
https://github.com/facebookresearch/dino/blob/main/vision_transformer.py
def vit_huge(patch_size=16, **kwargs):
model = VisionTransformer(
patch_size=patch_size, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=5.3375,
qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
return model
But I am getting size mismatch"size mismatch for blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([1280, 3416]) from checkpoint, the shap
e in current model is torch.Size([1280, 6832])." I am passing in patch size=14 only.
Could you tell what could be wrong in the architecture?
Thanks,
Rohan
Hi Rohan,
It looks like the DINOv1 code you're using does not use a SwiGLU MLP layer which is why you see the shape mismatch error.
We have not validated this checkpoint on other ViT code bases and it's highly recommended to use the timm
library for loading this checkpoint, i.e.:
model = timm.create_model("hf-hub:paige-ai/Virchow", pretrained=True, mlp_layer=timm.layers.SwiGLUPacked, act_layer=torch.nn.SiLU)
(full example: https://huggingface.co/paige-ai/Virchow#image-embeddings)
If timm
doesn't work for your use case please let us know and we can try to help get the checkpoint working with other ViT code bases!
Hi Adam,
Yes, getting a checkpoint compatible with other vit code bases and similar to checkpoints will be helpful, as I am extracting features from couple of pathology models and would be good to be consistent.
In the case of the DINOv1 code base you'll have to swap out the Mlp layer class with an appropriate SwiGLU implementation. Something like this may work (although I haven't validated it):
class SwiGluMlp(nn.Module):
def __init__(
self,
in_features,
hidden_features=None,
out_features=None,
act_layer=nn.SiLU,
drop=0.,
bias=True,
gate_last=False,
):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
assert hidden_features % 2 == 0
self.chunk_dim = -1
self.gate_last = gate_last
self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_features // 2, out_features, bias=bias)
def forward(self, x):
x = self.fc1(x)
x1, x2 = x.chunk(2, dim=self.chunk_dim)
x = x1 * self.act(x2) if self.gate_last else self.act(x1) * x2
x = self.fc2(x)
return x
Please do use the timm
implementation as a ground truth to verify the correctness of the model outputs if this works, there may be other differences besides just this layer.