Error when starting endpoint in both Huggingface and Sagemaker: RuntimeError: weight model.embed_tokens.weight does not exist
I'm consistently getting the following error message when setting up an endpoint according to instructions in both Huggingface as well as Sagemaker.
Error: ```
Server message:Endpoint failed to start. asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 233, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__\n model = FlashLlamaForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in __init__\n self.model = FlashLlamaModel(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 346, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} {"timestamp":"2024-01-30T21:18:25.376761Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]} Error: ShardCannotStart
Python code used to initialize endpoint in Sagemaker:
hub = {
'HF_MODEL_ID':'defog/sqlcoder-70b-alpha',
'SM_NUM_GPUS': json.dumps(1)
}
create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
env=hub,
role='',
)
deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.8xlarge",
container_startup_health_check_timeout=300,
)
Does anyone know how I can get this hosted in either Sagemaker or Huggingface?
Thanks for reporting β looking into this.
Hi there, we discovered a bizzare bug where the model's lm_head.weight was not uploaded to HF in the upload process. This is causing many integrations to break, and the model uploaded here is producing gibberish results.
Fix coming soon β hopefully in the next hour
Fixed with a reupload of the model weights! Apologies for the issue. Please let me know if you still run into problems