Jan 31

I'm consistently getting the following error message when setting up an endpoint according to instructions in both Huggingface as well as Sagemaker.

Error: ```
Server message:Endpoint failed to start. asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 233, in get_model\n return FlashLlama(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in __init__\n model = FlashLlamaForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in __init__\n self.model = FlashLlamaModel(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 346, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} {"timestamp":"2024-01-30T21:18:25.376761Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]} Error: ShardCannotStart


Python code used to initialize endpoint in Sagemaker:

hub = {
'HF_MODEL_ID':'defog/sqlcoder-70b-alpha',
'SM_NUM_GPUS': json.dumps(1)
}

create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
env=hub,
role='',
)

deploy model to SageMaker Inference

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.8xlarge",
container_startup_health_check_timeout=300,
)


Does anyone know how I can get this hosted in either Sagemaker or Huggingface?

rishdotblog

Defog.ai org Jan 31

Thanks for reporting – looking into this.

rishdotblog

Defog.ai org Jan 31

Hi there, we discovered a bizzare bug where the model's lm_head.weight was not uploaded to HF in the upload process. This is causing many integrations to break, and the model uploaded here is producing gibberish results.

Fix coming soon – hopefully in the next hour

rishdotblog

Defog.ai org Jan 31

Fixed with a reupload of the model weights! Apologies for the issue. Please let me know if you still run into problems

rishdotblog changed discussion status to closed Jan 31

defog
/

sqlcoder-70b-alpha

Error when starting endpoint in both Huggingface and Sagemaker: RuntimeError: weight model.embed_tokens.weight does not exist

create Hugging Face Model Class

deploy model to SageMaker Inference