Weights not used when initializing the model

#70
by nihalnayak - opened

Started getting this error today after some changes were made to the phi model. The model does not use all the weights from the checkpoint.

In [2]: model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
pytorch_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.84G/2.84G [00:24<00:00, 116MB/s]
Some weights of the model checkpoint at microsoft/phi-1_5 were not used when initializing PhiForCausalLM: ['model.layers.11.self_attn.q_proj.bias', 'model.layers.22.self_attn.k_proj.bias', 'model.layers.12.self_attn.q_proj.bias', 'model.layers.9.self_attn.k_proj.bias', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.22.self_attn.q_proj.bias', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.bias', 'model.layers.22.self_attn.v_proj.bias', 'model.layers.15.self_attn.v_proj.bias', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.9.self_attn.q_proj.bias', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.3.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.bias', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.9.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.17.self_attn.k_proj.bias', 'model.layers.7.self_attn.v_proj.bias', 'model.layers.13.self_attn.v_proj.bias', 'model.layers.20.self_attn.q_proj.bias', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.4.self_attn.v_proj.bias', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.bias', 'model.layers.16.self_attn.q_proj.bias', 'model.layers.19.self_attn.v_proj.bias', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.5.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.6.self_attn.k_proj.bias', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.7.self_attn.q_proj.bias', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.21.self_attn.k_proj.bias', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.16.self_attn.k_proj.bias', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.19.self_attn.q_proj.bias', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.18.self_attn.q_proj.bias', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.11.self_attn.v_proj.bias', 'model.layers.6.self_attn.q_proj.bias', 'model.layers.18.self_attn.v_proj.bias', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.5.self_attn.v_proj.bias', 'model.layers.14.self_attn.k_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.20.self_attn.v_proj.bias', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.8.self_attn.v_proj.bias', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.13.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.bias', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.19.self_attn.k_proj.bias', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.bias', 'model.layers.16.self_attn.v_proj.bias', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.3.self_attn.q_proj.bias', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.self_attn.q_proj.bias', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.8.self_attn.k_proj.bias', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.11.self_attn.k_proj.bias', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.18.self_attn.k_proj.bias', 'model.layers.14.self_attn.v_proj.bias', 'model.layers.15.self_attn.q_proj.bias', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.13.self_attn.k_proj.bias', 'model.layers.7.self_attn.k_proj.bias', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.23.self_attn.k_proj.bias', 'model.layers.6.self_attn.v_proj.bias', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.20.self_attn.k_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.23.self_attn.v_proj.bias', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.2.self_attn.v_proj.bias', 'model.layers.14.self_attn.q_proj.bias', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.3.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.bias', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.4.self_attn.k_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.5.self_attn.k_proj.bias']
- This IS expected if you are initializing PhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PhiForCausalLM were not initialized from the model checkpoint at microsoft/phi-1_5 and are newly initialized: ['model.layers.12.self_attn.query_key_value.weight', 'model.layers.7.self_attn.query_key_value.weight', 'model.layers.15.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.bias', 'model.layers.21.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.weight', 'model.layers.17.self_attn.query_key_value.weight', 'model.layers.4.self_attn.query_key_value.bias', 'model.layers.4.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.weight', 'model.layers.16.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.weight', 'model.layers.21.self_attn.query_key_value.bias', 'model.layers.7.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.bias', 'model.layers.17.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.weight', 'model.layers.22.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.bias', 'model.layers.0.self_attn.query_key_value.bias', 'model.layers.15.self_attn.query_key_value.weight', 'model.layers.10.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.weight', 'model.layers.0.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.bias', 'model.layers.22.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.bias', 'model.layers.10.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.bias', 'model.layers.14.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.bias', 'model.layers.16.self_attn.query_key_value.weight', 'model.layers.14.self_attn.query_key_value.bias', 'model.layers.12.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 74.0/74.0 [00:00<00:00, 70.9kB/s]

Phi in transformers library has different "architecture": for one instead of q_proj/k_proj they have single quiery_key_value

Simplest solution is to change config to use provided files

This worked for me; some steps might be redundant:

  • I've copied modeling_phi to modeling_phi_1_5, configuration_phi to configuration_phi_1_5.py to prevent filename collision with transformers if it checks
  • Added this into config.json:
    "auto_map": {
      "AutoConfig": "configuration_phi_1_5.PhiConfig",
      "AutoModelForCausalLM": "modeling_phi_1_5.PhiForCausalLM"
    },
  • Changed model_type to "model_type": "phi_1_5" (I think without this change transformers didn't try to load custom_code)
  • Changed architectures to "PhiForCausalLM_1_5" (I didn't change the .py file beyond renaming)

After that changes model loaded successfully.

Write a detailed analogy between mathematics and a lighthouse.

Answer: Mathematics is like a lighthouse, guiding us through the complex world of numbers and calculations. It illumin<MAX_NEW_TOKENS_REACHED>

(Interestingly even with do_sample=False I get different result from model card: Mathematics is like a lighthouse, guiding us through the vast ocean of numbers and calculations. Just as a lighthouse illuminates....)

Microsoft org

Hello @nihalnayak !

We just pushed a fix to the config.json and it should work now. The auto_map key was missing and hence it was not properly using the files on this repository when trust_remote_code=True.

Best regards,
Gustavo.

gugarosa changed discussion status to closed

Sign up or log in to comment