size mismatch for lm_head.weight: copying a param with shape torch.Size([32000, 5120]) from checkpoint, the shape in current model is torch.Size([32001, 5120]).

#11
by muruan - opened

I use vicuna-1.5 to finetune a text2SQL model. And I downloaded the model files from https://huggingface.co/lmsys/vicuna-13b-v1.5.
But when I started to train the model, just found this error. Anyone can provide some help? It confused me a week.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/home/yinma/.local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/extras/CUPTI/lib64'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/yinma/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
[2023-08-18 06:17:49,170] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-18 06:17:49,170] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-18 06:17:49,170] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-18 06:17:49,175] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-18 06:17:49,175] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:48<00:00, 16.04s/it]
Traceback (most recent call last):
File "/data/slc-a100/notebooks/yinma/DSS-GPT-FW/./train/fastchat_train_lora.py", line 440, in
train()
File "/data/slc-a100/notebooks/yinma/DSS-GPT-FW/./train/fastchat_train_lora.py", line 356, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/home/yinma/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
File "/home/yinma/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
) = cls._load_pretrained_model(
File "/home/yinma/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3310, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 5120]) from checkpoint, the shape in current model is torch.Size([32001, 5120]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([32000, 5120]) from checkpoint, the shape in current model is torch.Size([32001, 5120]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.
Loading checkpoint shards: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 2/3 [00:47<00:23, 23.89s/it][2023-08-18 06:19:54,339] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 72313
[2023-08-18 06:19:54,339] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 72314
[2023-08-18 06:19:56,941] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', './train/fastchat_train_lora.py', '--local_rank=1', '--deepspeed', './dp_config_v3.json', '--model_name_or_path', '/data/slc-a100/data/yinma/model/vicuna-13b-v1.5', '--lora_r', '8', '--lora_alpha', '16', '--lora_dropout', '0.05', '--data_path', './data/sql_fintune_data.json', '--output_dir', './ddp_models_v1', '--num_train_epochs', '1', '--fp16', 'True', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'steps', '--eval_steps', '100', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '3', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_strategy', 'steps', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '512', '--q_lora', 'False', '--gradient_checkpointing', 'True', '--is_instruction', 'True'] exits with return code = 1

My environment is :
fschat 0.2.23
transformers 4.31.0

I found the reason by myself. And add the comment to help the successors.
I finetuned the vicuna-1.0 model for my task, and then I forgot it and downloaded a new version of v1.5, then I fine-tuned it again from the old checkpoint, which leaded the error messege. Just finetuned it using a new output directory. That can resolved the problem.

Sign up or log in to comment