Failed to create inference endpoint

by brekk - opened Dec 7, 2023

Dec 7, 2023

Issue:
I cannot start inference endpoint, the log says:
2023/12/07 10:53:21 ~ Error: ShardCannotStart
2023/12/07 10:53:21 ~ {"timestamp":"2023-12-07T01:53:21.369939Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2023/12/07 10:53:21 ~ {"timestamp":"2023-12-07T01:53:21.369962Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

Steps for reproduce:
Deploy > Inference Endpoint > Select A10G AWS instance

Is there a way to use inference endpoint with this lora model?

Thanks in advance!

ybelkada

H4 Alignment Handbook org Dec 7, 2023

•

edited Dec 7, 2023

Hi @brekk
I am not sure the inference endpoints support Lora, you should consider use the merged model (which I believe is: https://huggingface.co/alignment-handbook/zephyr-7b-sft-full right @lewtun ?) - if not, you can merge the model yourself, please have a look at: https://huggingface.co/docs/peft/v0.7.0/en/package_reference/lora#peft.LoraModel.merge_and_unload but to merge the lora model you can just:

from peft import AutoPeftModelForCausalLM

merged_model_id = YOUR_NEW_MODEL_ID

model = AutoPeftModelForCausalLM.from_pretrained(peft_model_id)
merged_model = model.merge_and_unload()
merged_model.push_to_hub(YOUR_NEW_MODEL_ID)

brekk

Dec 8, 2023

Thank you for the reply @ybelkada .
I will give it a try!

sdyy

Nov 18, 2024

it run colab t4

!pip install transformers
!pip install peft
!pip install accelerate
!pip install bitsandbytes

import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

إنشاء مجلد للتخزين المؤقت

!mkdir -p /tmp/model_cache

تحميل النموذج مع إعدادات لتوفير الذاكرة

base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
device_map="auto",
load_in_8bit=True, # تقليل استهلاك الذاكرة
torch_dtype=torch.float16,
offload_folder="/tmp/model_cache" # مسار التخزين المؤقت
)

تحميل النموذج المعدل (LoRA)

peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(
base_model,
peft_model_id,
offload_folder="/tmp/model_cache"
)

دمج المحول

model.merge_adapter()

تحميل التوكنايزر

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token

تجهيز السؤال

prompt = "من هو نابليون بونابرت؟"

توكنة المدخلات

inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

توليد الإجابة

with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
max_length=150, # تقليل الحد الأقصى للإجابة
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)

فك الترميز وطباعة الإجابة

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

تنظيف الذاكرة

del model
del base_model
torch.cuda.empty_cache()

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment