Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型加载时间过长(近 2 小时)
#2
by
TimVan1
- opened
问题简述:
在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下,加载 Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4
模型时,加载时间近2 小时(约 7096 秒)。即使安装flash-attn
和auto_gptq
库下,加载速度仍非常慢!
完整描述:
环境说明
- 硬件配置:RTX 3090(24G) * 3
- 操作系统:Ubuntu 20.04.6 LTS
- Python 版本:3.10
- CUDA 版本:12.2
- PyTorch 版本:2.3.1
- 相关库版本:
- auto_gptq==0.7.1
- flash-attn==2.6.0 (从whl本地安装,cu122torch2.3cxx11abiFALSE-cp310版)
- optimum==1.21.2
- transformers==4.42.4 (从github源码本地安装)
问题描述
在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下,加载 Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4
模型时,加载时间近2 小时(约 7096 秒)。即使安装flash-attn
和auto_gptq
库下,加载速度仍非常慢!
代码
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda" # the device to load the model onto
model_name_or_path = "/home/ubuntu/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4"
# 记录开始时间
start_time = time.time()
# 记录模型加载开始时间
model_load_start_time = time.time()
# 加载模型
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
# 记录模型加载结束时间
model_load_end_time = time.time()
print("模型加载时间:", model_load_end_time - model_load_start_time)
输出日志
2024-07-17 15:43:40 /home/ubuntu/miniconda3/envs/timvan/lib/python3.10/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
2024-07-17 15:43:40 warnings.warn(
2024-07-17 16:25:29
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [19:17<38:34, 1157.34s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [41:19<20:54, 1254.48s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [41:20<00:00, 681.82s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [41:20<00:00, 826.72s/it]
2024-07-17 16:25:29 Some weights of the model checkpoint at /home/ubuntu/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 were not used when initializing Qwen2MoeForCausalLM: ['model.layers.0.mlp.experts.0.down_proj.bias', 'model.layers.0.mlp.experts.0.gate_proj.bias', 'model.layers.0.mlp.experts.0.
.........
.........
2024-07-17 17:29:02 - This IS expected if you are initializing Qwen2MoeForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2024-07-17 17:29:02 - This IS NOT expected if you are initializing Qwen2MoeForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2024-07-17 17:41:54 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-17 17:41:55 The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2024-07-17 17:42:44 模型加载时间: 7096.050642490387
2024-07-17 17:42:44 推理时间: 49.09219837188721
2024-07-17 17:42:44 总时间: 7145.170483589172
哪里有问题,应该如何改进?