Incompatible Keys error when loading checkpoints

#4
by Dheeraj700 - opened

I have initialized my openflamingo with "ViT-L-14", and open llama -7B when i try to load the checkpoint i have received this kind of warning and afterward during inference openflamingo is giving garbage result

checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-9B", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)

Output:
_IncompatibleKeys(missing_keys=['vision_encoder.positional_embedding', 'vision_encoder.text_projection', 'vision_encoder.logit_scale', 'vision_encoder.visual.class_embedding', 'vision_encoder.visual.positional_embedding', 'vision_encoder.visual.proj', 'vision_encoder.visual.conv1.weight', 'vision_encoder.visual.ln_pre.weight', 'vision_encoder.visual.ln_pre.bias', 'vision_encoder.visual.transformer.resblocks.0.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.0.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.0.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.0.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.0.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.0.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.0.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.0.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.0.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.0.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.0.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.0.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.1.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.1.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.1.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.1.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.1.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.1.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.1.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.1.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.1.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.1.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.1.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.1.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.2.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.2.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.2.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.2.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.2.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.2.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.2.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.2.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.2.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.2.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.2.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.2.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.3.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.3.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.3.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.3.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.3.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.3.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.3.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.3.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.3.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.3.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.3.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.3.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.4.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.4.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.4.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.4.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.4.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.4.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.4.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.4.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.4.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.4.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.4.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.4.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.5.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.5.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.5.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.5.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.5.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.5.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.5.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.5.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.5.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.5.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.5.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.5.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.6.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.6.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.6.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.6.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.6.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.6.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.6.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.6.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.6.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.6.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.6.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.6.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.7.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.7.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.7.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.7.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.7.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.7.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.7.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.7.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.7.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.7.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.7.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.7.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.8.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.8.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.8.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.8.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.8.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.8.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.8.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.8.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.8.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.8.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.8.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.8.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.9.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.9.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.9.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.9.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.9.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.9.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.9.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.9.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.9.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.9.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.9.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.9.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.10.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.10.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.10.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.10.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.10.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.10.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.10.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.10.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.10.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.10.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.10.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.10.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.11.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.11.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.11.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.11.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.11.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.11.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.11.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.11.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.11.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.11.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.11.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.11.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.12.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.12.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.12.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.12.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.12.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.12.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.12.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.12.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.12.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.12.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.12.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.12.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.13.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.13.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.13.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.13.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.13.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.13.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.13.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.13.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.13.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.13.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.13.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.13.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.14.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.14.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.14.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.14.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.14.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.14.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.14.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.14.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.14.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.14.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.14.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.14.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.15.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.15.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.15.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.15.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.15.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.15.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.15.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.15.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.15.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.15.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.15.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.15.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.16.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.16.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.16.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.16.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.16.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.16.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.16.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.16.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.16.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.16.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.16.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.16.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.17.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.17.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.17.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.17.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.17.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.17.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.17.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.17.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.17.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.17.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.17.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.17.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.18.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.18.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.18.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.18.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.18.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.18.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.18.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.18.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.18.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.18.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.18.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.18.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.19.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.19.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.19.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.19.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.19.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.19.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.19.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.19.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.19.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.19.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.19.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.19.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.20.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.20.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.20.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.20.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.20.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.20.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.20.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.20.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.20.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.20.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.20.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.20.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.21.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.21.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.21.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.21.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.21.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.21.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.21.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.21.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.21.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.21.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.21.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.21.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.22.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.22.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.22.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.22.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.22.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.22.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.22.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.22.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.22.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.22.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.22.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.22.mlp.c_proj.bias', 'vision_encoder.visual.transformer.resblocks.23.ln_1.weight', 'vision_encoder.visual.transformer.resblocks.23.ln_1.bias', 'vision_encoder.visual.transformer.resblocks.23.attn.in_proj_weight', 'vision_encoder.visual.transformer.resblocks.23.attn.in_proj_bias', 'vision_encoder.visual.transformer.resblocks.23.attn.out_proj.weight', 'vision_encoder.visual.transformer.resblocks.23.attn.out_proj.bias', 'vision_encoder.visual.transformer.resblocks.23.ln_2.weight', 'vision_encoder.visual.transformer.resblocks.23.ln_2.bias', 'vision_encoder.visual.transformer.resblocks.23.mlp.c_fc.weight', 'vision_encoder.visual.transformer.resblocks.23.mlp.c_fc.bias', 'vision_encoder.visual.transformer.resblocks.23.mlp.c_proj.weight', 'vision_encoder.visual.transformer.resblocks.23.mlp.c_proj.bias', 'vision_encoder.visual.ln_post.weight', 'vision_encoder.visual.ln_post.bias', 'vision_encoder.transformer.resblocks.0.ln_1.weight', 'vision_encoder.transformer.resblocks.0.ln_1.bias', 'vision_encoder.transformer.resblocks.0.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.0.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.0.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.0.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.0.ln_2.weight', 'vision_encoder.transformer.resblocks.0.ln_2.bias', 'vision_encoder.transformer.resblocks.0.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.0.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.0.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.0.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.1.ln_1.weight', 'vision_encoder.transformer.resblocks.1.ln_1.bias', 'vision_encoder.transformer.resblocks.1.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.1.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.1.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.1.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.1.ln_2.weight', 'vision_encoder.transformer.resblocks.1.ln_2.bias', 'vision_encoder.transformer.resblocks.1.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.1.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.1.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.1.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.2.ln_1.weight', 'vision_encoder.transformer.resblocks.2.ln_1.bias', 'vision_encoder.transformer.resblocks.2.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.2.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.2.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.2.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.2.ln_2.weight', 'vision_encoder.transformer.resblocks.2.ln_2.bias', 'vision_encoder.transformer.resblocks.2.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.2.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.2.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.2.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.3.ln_1.weight', 'vision_encoder.transformer.resblocks.3.ln_1.bias', 'vision_encoder.transformer.resblocks.3.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.3.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.3.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.3.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.3.ln_2.weight', 'vision_encoder.transformer.resblocks.3.ln_2.bias', 'vision_encoder.transformer.resblocks.3.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.3.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.3.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.3.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.4.ln_1.weight', 'vision_encoder.transformer.resblocks.4.ln_1.bias', 'vision_encoder.transformer.resblocks.4.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.4.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.4.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.4.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.4.ln_2.weight', 'vision_encoder.transformer.resblocks.4.ln_2.bias', 'vision_encoder.transformer.resblocks.4.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.4.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.4.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.4.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.5.ln_1.weight', 'vision_encoder.transformer.resblocks.5.ln_1.bias', 'vision_encoder.transformer.resblocks.5.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.5.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.5.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.5.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.5.ln_2.weight', 'vision_encoder.transformer.resblocks.5.ln_2.bias', 'vision_encoder.transformer.resblocks.5.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.5.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.5.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.5.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.6.ln_1.weight', 'vision_encoder.transformer.resblocks.6.ln_1.bias', 'vision_encoder.transformer.resblocks.6.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.6.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.6.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.6.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.6.ln_2.weight', 'vision_encoder.transformer.resblocks.6.ln_2.bias', 'vision_encoder.transformer.resblocks.6.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.6.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.6.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.6.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.7.ln_1.weight', 'vision_encoder.transformer.resblocks.7.ln_1.bias', 'vision_encoder.transformer.resblocks.7.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.7.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.7.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.7.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.7.ln_2.weight', 'vision_encoder.transformer.resblocks.7.ln_2.bias', 'vision_encoder.transformer.resblocks.7.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.7.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.7.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.7.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.8.ln_1.weight', 'vision_encoder.transformer.resblocks.8.ln_1.bias', 'vision_encoder.transformer.resblocks.8.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.8.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.8.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.8.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.8.ln_2.weight', 'vision_encoder.transformer.resblocks.8.ln_2.bias', 'vision_encoder.transformer.resblocks.8.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.8.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.8.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.8.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.9.ln_1.weight', 'vision_encoder.transformer.resblocks.9.ln_1.bias', 'vision_encoder.transformer.resblocks.9.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.9.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.9.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.9.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.9.ln_2.weight', 'vision_encoder.transformer.resblocks.9.ln_2.bias', 'vision_encoder.transformer.resblocks.9.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.9.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.9.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.9.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.10.ln_1.weight', 'vision_encoder.transformer.resblocks.10.ln_1.bias', 'vision_encoder.transformer.resblocks.10.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.10.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.10.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.10.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.10.ln_2.weight', 'vision_encoder.transformer.resblocks.10.ln_2.bias', 'vision_encoder.transformer.resblocks.10.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.10.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.10.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.10.mlp.c_proj.bias', 'vision_encoder.transformer.resblocks.11.ln_1.weight', 'vision_encoder.transformer.resblocks.11.ln_1.bias', 'vision_encoder.transformer.resblocks.11.attn.in_proj_weight', 'vision_encoder.transformer.resblocks.11.attn.in_proj_bias', 'vision_encoder.transformer.resblocks.11.attn.out_proj.weight', 'vision_encoder.transformer.resblocks.11.attn.out_proj.bias', 'vision_encoder.transformer.resblocks.11.ln_2.weight', 'vision_encoder.transformer.resblocks.11.ln_2.bias', 'vision_encoder.transformer.resblocks.11.mlp.c_fc.weight', 'vision_encoder.transformer.resblocks.11.mlp.c_fc.bias', 'vision_encoder.transformer.resblocks.11.mlp.c_proj.weight', 'vision_encoder.transformer.resblocks.11.mlp.c_proj.bias', 'vision_encoder.token_embedding.weight', 'vision_encoder.ln_final.weight', 'vision_encoder.ln_final.bias', 'lang_encoder.model.layers.0.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.0.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.0.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.1.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.1.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.1.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.2.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.2.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.2.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.3.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.3.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.3.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.4.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.4.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.4.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.5.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.5.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.5.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.6.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.6.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.6.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.7.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.7.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.7.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.8.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.8.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.8.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.9.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.9.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.9.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.10.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.10.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.10.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.11.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.11.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.11.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.12.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.12.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.12.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.13.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.13.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.13.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.14.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.14.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.14.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.15.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.15.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.15.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.16.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.16.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.16.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.17.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.17.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.17.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.18.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.18.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.18.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.19.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.19.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.19.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.20.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.20.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.20.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.21.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.21.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.21.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.22.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.22.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.22.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.23.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.23.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.23.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.24.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.24.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.24.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.25.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.25.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.25.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.26.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.26.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.26.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.27.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.27.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.27.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.28.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.28.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.28.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.29.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.29.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.29.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.30.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.30.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.30.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.layers.31.decoder_layer.self_attn.q_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.self_attn.k_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.self_attn.v_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.self_attn.o_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.mlp.gate_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.mlp.down_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.mlp.up_proj.weight', 'lang_encoder.model.layers.31.decoder_layer.input_layernorm.weight', 'lang_encoder.model.layers.31.decoder_layer.post_attention_layernorm.weight', 'lang_encoder.model.norm.weight', 'lang_encoder.lm_head.weight'], unexpected_keys=['vision_encoder.text_model.embeddings.position_ids', 'vision_encoder.vision_model.embeddings.position_ids'])

openflamingo org

Can you share your code for initializing the model?

I am using now llama that i initialize from a hugging face model huggyllama/llama-7b

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="huggyllama/llama-7b",
tokenizer_path="huggyllama/llama-7b",
cross_attn_every_n_layers=4
)

It's just normal, because this two lines

checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-9B", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)

are used to load the trainable params, which is the flamingo.perceiver.
Therefore the error showed that the params for lang_encoder and vision_encoder are missing, not perceiver.
The lang_encoder and vision_encoder are loaded by :

model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="huggyllama/llama-7b",
tokenizer_path="huggyllama/llama-7b",
cross_attn_every_n_layers=4
)

So this is a indication that you have finished loading.

Sign up or log in to comment