Weights not used when initializing the model

#70

by nihalnayak - opened Jan 10

Jan 10

Started getting this error today after some changes were made to the phi model. The model does not use all the weights from the checkpoint.

In [2]: model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.84G/2.84G [00:24<00:00, 116MB/s]
Some weights of the model checkpoint at microsoft/phi-1_5 were not used when initializing PhiForCausalLM: ['model.layers.11.self_attn.q_proj.bias', 'model.layers.22.self_attn.k_proj.bias', 'model.layers.12.self_attn.q_proj.bias', 'model.layers.9.self_attn.k_proj.bias', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.22.self_attn.q_proj.bias', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.bias', 'model.layers.22.self_attn.v_proj.bias', 'model.layers.15.self_attn.v_proj.bias', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.9.self_attn.q_proj.bias', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.3.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.bias', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.9.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.17.self_attn.k_proj.bias', 'model.layers.7.self_attn.v_proj.bias', 'model.layers.13.self_attn.v_proj.bias', 'model.layers.20.self_attn.q_proj.bias', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.4.self_attn.v_proj.bias', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.bias', 'model.layers.16.self_attn.q_proj.bias', 'model.layers.19.self_attn.v_proj.bias', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.5.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.6.self_attn.k_proj.bias', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.7.self_attn.q_proj.bias', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.21.self_attn.k_proj.bias', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.16.self_attn.k_proj.bias', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.19.self_attn.q_proj.bias', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.18.self_attn.q_proj.bias', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.11.self_attn.v_proj.bias', 'model.layers.6.self_attn.q_proj.bias', 'model.layers.18.self_attn.v_proj.bias', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.5.self_attn.v_proj.bias', 'model.layers.14.self_attn.k_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.20.self_attn.v_proj.bias', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.8.self_attn.v_proj.bias', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.13.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.bias', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.19.self_attn.k_proj.bias', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.bias', 'model.layers.16.self_attn.v_proj.bias', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.3.self_attn.q_proj.bias', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.self_attn.q_proj.bias', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.8.self_attn.k_proj.bias', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.11.self_attn.k_proj.bias', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.18.self_attn.k_proj.bias', 'model.layers.14.self_attn.v_proj.bias', 'model.layers.15.self_attn.q_proj.bias', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.13.self_attn.k_proj.bias', 'model.layers.7.self_attn.k_proj.bias', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.23.self_attn.k_proj.bias', 'model.layers.6.self_attn.v_proj.bias', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.20.self_attn.k_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.23.self_attn.v_proj.bias', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.2.self_attn.v_proj.bias', 'model.layers.14.self_attn.q_proj.bias', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.3.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.bias', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.4.self_attn.k_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.5.self_attn.k_proj.bias']
- This IS expected if you are initializing PhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PhiForCausalLM were not initialized from the model checkpoint at microsoft/phi-1_5 and are newly initialized: ['model.layers.12.self_attn.query_key_value.weight', 'model.layers.7.self_attn.query_key_value.weight', 'model.layers.15.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.bias', 'model.layers.21.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.weight', 'model.layers.17.self_attn.query_key_value.weight', 'model.layers.4.self_attn.query_key_value.bias', 'model.layers.4.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.weight', 'model.layers.16.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.weight', 'model.layers.21.self_attn.query_key_value.bias', 'model.layers.7.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.bias', 'model.layers.17.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.weight', 'model.layers.6.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.weight', 'model.layers.22.self_attn.query_key_value.weight', 'model.layers.2.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.bias', 'model.layers.0.self_attn.query_key_value.bias', 'model.layers.15.self_attn.query_key_value.weight', 'model.layers.10.self_attn.query_key_value.weight', 'model.layers.23.self_attn.query_key_value.weight', 'model.layers.0.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.bias', 'model.layers.22.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.bias', 'model.layers.10.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.bias', 'model.layers.14.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.bias', 'model.layers.16.self_attn.query_key_value.weight', 'model.layers.14.self_attn.query_key_value.bias', 'model.layers.12.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.bias', 'model.layers.3.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74.0/74.0 [00:00<00:00, 70.9kB/s]

Maykeye

Jan 11

Phi in transformers library has different "architecture": for one instead of q_proj/k_proj they have single quiery_key_value

Simplest solution is to change config to use provided files

This worked for me; some steps might be redundant:

I've copied modeling_phi to modeling_phi_1_5, configuration_phi to configuration_phi_1_5.py to prevent filename collision with transformers if it checks
Added this into config.json:

    "auto_map": {
      "AutoConfig": "configuration_phi_1_5.PhiConfig",
      "AutoModelForCausalLM": "modeling_phi_1_5.PhiForCausalLM"
    },

Changed model_type to "model_type": "phi_1_5" (I think without this change transformers didn't try to load custom_code)
Changed architectures to "PhiForCausalLM_1_5" (I didn't change the .py file beyond renaming)

After that changes model loaded successfully.

Write a detailed analogy between mathematics and a lighthouse.

Answer: Mathematics is like a lighthouse, guiding us through the complex world of numbers and calculations. It illumin<MAX_NEW_TOKENS_REACHED>

(Interestingly even with do_sample=False I get different result from model card: Mathematics is like a lighthouse, guiding us through the vast ocean of numbers and calculations. Just as a lighthouse illuminates....)

gugarosa

Microsoft org Jan 11

Hello @nihalnayak !

We just pushed a fix to the config.json and it should work now. The auto_map key was missing and hence it was not properly using the files on this repository when trust_remote_code=True.

Best regards,
Gustavo.

gugarosa changed discussion status to closed Jan 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment