Quantizations of https://huggingface.co/01-ai/Yi-6B-200K
Experiment
Quants ending in "_X" are experimental quants. These quants are the same as normal quants, but their token embedding weights are set to Q8_0 except for Q6_K and Q8_0 which are set to F16. The change will make these experimental quants larger but in theory, should result in improved performance.
List of experimental quants:
- Q2_K_X
- Q4_K_M_X
- Q5_K_M_X
- Q6_K_X
- Q8_0_X
From original readme
Perform inference with Yi chat model
Create a file named
quick_start.py
and copy the following content to it.from transformers import AutoModelForCausalLM, AutoTokenizer model_path = '<your-model-path>' tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) # Since transformers 4.35.0, the GPT-Q/AWQ model can be loaded using AutoModelForCausalLM. model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", torch_dtype='auto' ).eval() # Prompt content: "hi" messages = [ {"role": "user", "content": "hi"} ] input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt') output_ids = model.generate(input_ids.to('cuda')) response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True) # Model response: "Hello! How can I assist you today?" print(response)
Run
quick_start.py
.python quick_start.py
Then you can see an output similar to the one below. 🥳
Hello! How can I assist you today?
Perform inference with Yi base model
Yi-34B
The steps are similar to pip - Perform inference with Yi chat model.
You can use the existing file
text_generation.py
.python demo/text_generation.py --model <your-model-path>
Then you can see an output similar to the one below. 🥳
Output. ⬇️
Prompt: Let me tell you an interesting story about cat Tom and mouse Jerry,
Generation: Let me tell you an interesting story about cat Tom and mouse Jerry, which happened in my childhood. My father had a big house with two cats living inside it to kill mice. One day when I was playing at home alone, I found one of the tomcats lying on his back near our kitchen door, looking very much like he wanted something from us but couldn’t get up because there were too many people around him! He kept trying for several minutes before finally giving up...
Yi-9B
Input
from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_DIR = "01-ai/Yi-9B" model = AutoModelForCausalLM.from_pretrained(MODEL_DIR, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False) input_text = "# write the quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output
# write the quick sort algorithm def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[len(arr) // 2] left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) # test the quick sort algorithm print(quick_sort([3, 6, 8, 10, 1, 2, 1]))
- Downloads last month
- 264