Failed to create inference endpoint

#1
by brekk - opened

Issue:
I cannot start inference endpoint, the log says:
2023/12/07 10:53:21 ~ Error: ShardCannotStart
2023/12/07 10:53:21 ~ {"timestamp":"2023-12-07T01:53:21.369939Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2023/12/07 10:53:21 ~ {"timestamp":"2023-12-07T01:53:21.369962Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

Steps for reproduce:
Deploy > Inference Endpoint > Select A10G AWS instance

Is there a way to use inference endpoint with this lora model?

Thanks in advance!

H4 Alignment Handbook org
edited Dec 7, 2023

Hi @brekk
I am not sure the inference endpoints support Lora, you should consider use the merged model (which I believe is: https://huggingface.co/alignment-handbook/zephyr-7b-sft-full right @lewtun ?) - if not, you can merge the model yourself, please have a look at: https://huggingface.co/docs/peft/v0.7.0/en/package_reference/lora#peft.LoraModel.merge_and_unload but to merge the lora model you can just:

from peft import AutoPeftModelForCausalLM

merged_model_id = YOUR_NEW_MODEL_ID

model = AutoPeftModelForCausalLM.from_pretrained(peft_model_id)
merged_model = model.merge_and_unload()
merged_model.push_to_hub(YOUR_NEW_MODEL_ID)

Thank you for the reply @ybelkada .
I will give it a try!

it run colab t4

!pip install transformers
!pip install peft
!pip install accelerate
!pip install bitsandbytes

import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

إنشاء مجلد للتخزين المؤقت

!mkdir -p /tmp/model_cache

تحميل النموذج مع إعدادات لتوفير الذاكرة

base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
device_map="auto",
load_in_8bit=True, # تقليل استهلاك الذاكرة
torch_dtype=torch.float16,
offload_folder="/tmp/model_cache" # مسار التخزين المؤقت
)

تحميل النموذج المعدل (LoRA)

peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(
base_model,
peft_model_id,
offload_folder="/tmp/model_cache"
)

دمج المحول

model.merge_adapter()

تحميل التوكنايزر

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token

تجهيز السؤال

prompt = "من هو نابليون بونابرت؟"

توكنة المدخلات

inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

توليد الإجابة

with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
max_length=150, # تقليل الحد الأقصى للإجابة
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)

فك الترميز وطباعة الإجابة

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

تنظيف الذاكرة

del model
del base_model
torch.cuda.empty_cache()

Sign up or log in to comment