Running llama-2-7b-chat locally

#52
by ohsa1122 - opened

Hi, I am using the llama-2-7b-chat online API on this link: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat to make inferences, and the accuracy I am getting is pretty good.

I am trying to achieve the same results locally but I am unable to. I am using the following setup:

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)

sequences = pipeline(
prompt,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=1000,
)

The accuracy I am getting is way lower. My question is what type of GPU is being used in the online API? And what are inputs being used in the pipeline call?

I did not try this solution, but this blog have demo working on huggingface . GPU it suggests
https://huggingface.co/blog/llama2#:~:text=For%207B%20models,4x%20Nvidia%20A100%22)

Sign up or log in to comment