TGI model serving errors
I tried to use TGI to serve the model, but I got following errors.
Any comment will be appreciated.
2024-07-02T01:16:56.299481Z INFO text_generation_launcher: Args {
model_id: "yentinglin/Llama-3-Taiwan-8B-Instruct-128k",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Bitsandbytes,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
96000,
),
max_total_tokens: Some(
128000,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
96000,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "423c2839e036",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
2024-07-02T01:16:56.299797Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-07-02T01:16:56.710693Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-07-02T01:16:56.711086Z INFO download: text_generation_launcher: Starting download process.
2024-07-02T01:17:03.588211Z INFO text_generation_launcher: Download file: model-00001-of-00004.safetensors
2024-07-02T01:19:13.452334Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00001-of-00004.safetensors in 0:02:09.
2024-07-02T01:19:13.452806Z INFO text_generation_launcher: Download: [1/5] -- ETA: 0:08:36
2024-07-02T01:19:13.454338Z INFO text_generation_launcher: Download file: model-00002-of-00004.safetensors
2024-07-02T01:21:25.292526Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00002-of-00004.safetensors in 0:02:11.
2024-07-02T01:21:25.292664Z INFO text_generation_launcher: Download: [2/5] -- ETA: 0:06:31.500000
2024-07-02T01:21:25.293379Z INFO text_generation_launcher: Download file: model-00003-of-00004.safetensors
2024-07-02T01:23:39.538011Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00003-of-00004.safetensors in 0:02:14.
2024-07-02T01:23:39.538489Z INFO text_generation_launcher: Download: [3/5] -- ETA: 0:04:23.333334
2024-07-02T01:23:39.540114Z INFO text_generation_launcher: Download file: model-00004-of-00004.safetensors
2024-07-02T01:24:11.750936Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors in 0:00:32.
2024-07-02T01:24:11.751393Z INFO text_generation_launcher: Download: [4/5] -- ETA: 0:01:47
2024-07-02T01:24:11.752106Z INFO text_generation_launcher: Download file: model.safetensors
5003f467faff/model.safetensors in 0:00:00.
2024-07-02T01:24:12.290813Z INFO text_generation_launcher: Download: [5/5] -- ETA: 0
2024-07-02T01:24:13.297441Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-07-02T01:24:13.297892Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-02T01:24:18.628661Z INFO text_generation_launcher: Detected system cuda
2024-07-02T01:24:23.315694Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-02T01:24:24.011563Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 94, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 267, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 225, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 591, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__
weights = Weights(filenames, device, dtype, process_group=self.process_group)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 37, in __init__
raise RuntimeError(
RuntimeError: Key lm_head.weight was found in multiple files: /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model.safetensors and /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors
2024-07-02T01:24:25.118582Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 94, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 267, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 225, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 591, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__
weights = Weights(filenames, device, dtype, process_group=self.process_group)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 37, in __init__
raise RuntimeError(
RuntimeError: Key lm_head.weight was found in multiple files: /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model.safetensors and /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors
rank=0
2024-07-02T01:24:25.214480Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-02T01:24:25.214519Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
This is the command to start the TGI
model=yentinglin/Llama-3-Taiwan-8B-Instruct-128k
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -e HF_TOKEN='hf_xxxx' --gpus '"device=1"' --shm-size 1g -p 8081:80 -v $volume:/data --name Llama-3-Taiwan-8B-Instruct-128k ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes --max-input-length=96000 --max-total-tokens=128000 --max-batch-prefill-tokens 96000
modify format
I've encountered similar bugs but happended when downloading model. I use AutoModelForCausalLM for downloading model from huggingface and the error messages indicates that there are parameter size different issue. Belows is the code for reproduce bugs and I my transformers version is 4.40.0 .
from transformers import AutoModelForCausalLM
model=AutoModelForCausalLM.from_pretrained('yentinglin/Llama-3-Taiwan-8B-Instruct-128k')
The error message I got is here:
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([128258, 4096]). size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]). size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]). size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
.
.
.
size mismatch for model.layers.31.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for model.layers.31.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for model.layers.31.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
size mismatch for model.layers.31.self_attn.o_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
size mismatch for model.layers.31.mlp.gate_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for model.layers.31.mlp.up_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([14336, 4096]).
size mismatch for model.layers.31.mlp.down_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 14336]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([128258, 4096]).
You may consider adding ignore_mismatched_sizes=True
in the model from_pretrained
method.
Is the version fo transformers lead to this issue? I just assume that as long as the transfoermers version is new enough for running Llama3 then it is new enough for the model to as this can be viewed as another fine tuned version of Llama3. I've got everything ok under same enviornment and using same procedure for your none-128k token version ('yentinglin/Llama-3-Taiwan-8B-Instruct'
). Besides I tried adding ignore_mismatched_sizes=True
for downloading but the generation based on the snippet code in the Model card is bad so I think that there must be some bugs happended in the loading process.
it's weird. i will check and hopefully i didnt mess up anything...
@yentinglin
, it seems like there is a model.safetensors
file in the repo which indicates it contains the full model. Alongside it, there are sharded files.
It seems like the issue here is that the model.safetensors
isn't the full-version of the model: it's a 562kb file. The inference engines try and load that file, but it's incomplete when compared to the sharded files.
Would it be possible to remove that model.safetensors
file?
sure it's removed now
It works. Thank you!