Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

GGML Version

#4
by s3nh - opened

Outstanding work! just convert it to ggml, check it out if your are interested! https://huggingface.co/s3nh/LLaMA-2-7B-32K-GGML

@s3nh Will your converted model can run on colab's CPU easily?

Together org

@deepakkaura26 I think so! by default you get 2 vCPUs on colab with 13G RAM which should be enough to run the ggml versions

@mauriceweber actually I tried it but whether I choose CPU or GPU my colab got crashed 5 times.

Which quantization did you try? I tried the 4bit version on colab and could run it without problems.

import ctransformers
from ctransformers import AutoModelForCausalLM

model_file = "LLaMA-2-7B-32K.ggmlv3.q4_0.bin"
model = AutoModelForCausalLM.from_pretrained("s3nh/LLaMA-2-7B-32K-GGML",  model_type="llama", model_file=model_file)

prompt = "Whales have been living in the oceans for millions of years "
model(prompt, max_new_tokens=128, temperature=0.9, top_p= 0.7)

EDIT: load model directly from hub.

@mauriceweber I have use this same example which is present in this model website

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

@mauriceweber I tried to run your codes which you showed they give me this following error


HTTPError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
260 try:
--> 261 response.raise_for_status()
262 except HTTPError as e:

11 frames
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main

The above exception was the direct cause of the following exception:

RepositoryNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
291 " make sure you are authenticated."
292 )
--> 293 raise RepositoryNotFoundError(message, response) from e
294
295 elif response.status_code == 400:

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64caab34-5bd826d76686f26a76b02644;7f562443-2822-41e5-bcd0-37c62aef99f9)

Repository Not Found for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

Together org

@mauriceweber I have use this same example which is present in this model website

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

Here you are not using the quantized (ggml) models, which is why you are running out of memory (you need around 14GB RAM for the 7B model with float16).

@mauriceweber I tried to run your codes which you showed they give me this following error

This is error is because the model is not downloaded yet (I was assuming you had it downloaded to colab) -- I adjusted the code snippet above so that the model file gets pulled directly from the repo. You can check the other model versions here.

Let us know how it goes!:)

Is this model already trained? running the example code just gives me this:
Screenshot 2023-08-25 at 20.54.51.png

Sign up or log in to comment