Weights not used when initializing the model
Started getting this error today after some changes were made to the phi model. The model does not use all the weights from the checkpoint.
In [2]: model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
pytorch_model.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2.84G/2.84G [00:24<00:00, 116MB/s]
Some weights of the model checkpoint at microsoft/phi-1_5 were not used when initializing PhiForCausalLM: ['model.layers.11.self_attn.q_proj.bias', 'model.layers.22.self_attn.k_proj.bias', 'model.layers.12.self_attn.q_proj.bias', 'model.layers.9.self_attn.k_proj.bias', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.22.self_attn.q_proj.bias', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.bias', 'model.layers.22.self_attn.v_proj.bias', 'model.layers.15.self_attn.v_proj.bias', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.9.self_attn.q_proj.bias', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.3.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.bias', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.9.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.17.self_attn.k_proj.bias', 'model.layers.7.self_attn.v_proj.bias', 'model.layers.13.self_attn.v_proj.bias', 'model.layers.20.self_attn.q_proj.bias', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.4.self_attn.v_proj.bias', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.bias', 'model.layers.16.self_attn.q_proj.bias', 'model.layers.19.self_attn.v_proj.bias', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.5.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.6.self_attn.k_proj.bias', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.7.self_attn.q_proj.bias', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.21.self_attn.k_proj.bias', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.16.self_attn.k_proj.bias', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.19.self_attn.q_proj.bias', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.18.self_attn.q_proj.bias', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.11.self_attn.v_proj.bias', 'model.layers.6.self_attn.q_proj.bias', 'model.layers.18.self_attn.v_proj.bias', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.5.self_attn.v_proj.bias', 'model.layers.14.self_attn.k_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.20.self_attn.v_proj.bias', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.8.self_attn.v_proj.bias', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.13.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.bias', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.19.self_attn.k_proj.bias', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.bias', 'model.layers.16.self_attn.v_proj.bias', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.3.self_attn.q_proj.bias', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.self_attn.q_proj.bias', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.8.self_attn.k_proj.bias', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.11.self_attn.k_proj.bias', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.18.self_attn.k_proj.bias', 'model.layers.14.self_attn.v_proj.bias', 'model.layers.15.self_attn.q_proj.bias', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.13.self_attn.k_proj.bias', 'model.layers.7.self_attn.k_proj.bias', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.23.self_attn.k_proj.bias', 'model.layers.6.self_attn.v_proj.bias', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.20.self_attn.k_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.23.self_attn.v_proj.bias', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.2.self_attn.v_proj.bias', 'model.layers.14.self_attn.q_proj.bias', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.3.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.bias', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.4.self_attn.k_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.5.self_attn.k_proj.bias']
- This IS expected if you are initializing PhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PhiForCausalLM were not initialized from the model checkpoint at microsoft/phi-1_5 and are newly initialized: ['model.layers.12.self_attn.query_key_value.weight', 'model.layers.7.self_attn.query_key_value.weight', 'model.layers.15.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.bias', 'model.layers.21.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.weight', 'model.layers.17.self_attn.query_key_value.weight', 'model.layers.4.self_attn.query_key_value.bias', 'model.layers.4.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.weight', 'model.layers.16.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.weight', 'model.layers.21.self_attn.query_key_value.bias', 'model.layers.7.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.bias', 'model.layers.17.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.weight', 'model.layers.22.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.bias', 'model.layers.0.self_attn.query_key_value.bias', 'model.layers.15.self_attn.query_key_value.weight', 'model.layers.10.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.weight', 'model.layers.0.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.bias', 'model.layers.22.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.bias', 'model.layers.10.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.bias', 'model.layers.14.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.bias', 'model.layers.16.self_attn.query_key_value.weight', 'model.layers.14.self_attn.query_key_value.bias', 'model.layers.12.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 74.0/74.0 [00:00<00:00, 70.9kB/s]
Phi in transformers library has different "architecture": for one instead of q_proj/k_proj they have single quiery_key_value
Simplest solution is to change config to use provided files
This worked for me; some steps might be redundant:
- I've copied modeling_phi to modeling_phi_1_5, configuration_phi to configuration_phi_1_5.py to prevent filename collision with transformers if it checks
- Added this into config.json:
"auto_map": {
"AutoConfig": "configuration_phi_1_5.PhiConfig",
"AutoModelForCausalLM": "modeling_phi_1_5.PhiForCausalLM"
},
- Changed
model_type
to"model_type": "phi_1_5"
(I think without this change transformers didn't try to load custom_code) - Changed
architectures
to "PhiForCausalLM_1_5" (I didn't change the .py file beyond renaming)
After that changes model loaded successfully.
Write a detailed analogy between mathematics and a lighthouse.
Answer: Mathematics is like a lighthouse, guiding us through the complex world of numbers and calculations. It illumin<MAX_NEW_TOKENS_REACHED>
(Interestingly even with do_sample=False I get different result from model card: Mathematics is like a lighthouse, guiding us through the vast ocean of numbers and calculations. Just as a lighthouse illuminates....)
Hello @nihalnayak !
We just pushed a fix to the config.json
and it should work now. The auto_map
key was missing and hence it was not properly using the files on this repository when trust_remote_code=True
.
Best regards,
Gustavo.