--- license: other license_name: yi-license license_link: LICENSE base_model: 01-ai/Yi-34B-200K inference: false model_creator: 01-ai model_name: Yi-34B-200K model_type: yi prompt_template: '{system}\n\nHuman:\n{prompt}\n\nAssistant:\n' quantized_by: Sombressoul ---
# Yi-34B-200K - AWQ - Model creator: [01-ai](https://huggingface.co/01-ai) - Original model: [`Yi-34B-200K`](https://huggingface.co/01-ai/Yi-34B-200K) This is a quantized (AWQ) version of [`Yi-34B-200K`](https://huggingface.co/01-ai/Yi-34B-200K). For more information about the model, see the original page. ## Quantization Quantization was performed using [casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ). The Orca dataset was used to weigh the attention activations. **Quantization code** ```python from awq import AutoAWQForCausalLM from datasets import load_dataset from random import randint from transformers import AutoTokenizer dataset_size = 20_000 model_path = '/PATH/TO/ORIGINAL/YI-34B-200K' quant_path = 'quant' quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) def load_orca(): data = load_dataset('Open-Orca/OpenOrca', split='train') def concatenate_data(x): s = str(x['system_prompt']).strip() q = str(x['question']).strip().split('Answer:')[0].strip() a = str(x['response']).strip() return '{}\n\nHuman:\n{}\n\nAssistant:\n{}\n'.format(s, q, a) concatenated = [concatenate_data(data[randint(0, len(data))]) for _ in range(dataset_size)] return [text for text in concatenated] model.quantize(tokenizer, quant_config=quant_config, calib_data=load_orca()) model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path) ``` The model has been successfully tested. ## Inferencing The model feels great on the RTX4090 (consuming more than 95% of VRAM), but get ready to be patient with a relatively large context size. During testing, the following query template was used: ```python # ... template = f'{system_prompt}\n\nHuman:\n{user_prompt}\n\nAssistant:\n' # ... ``` **Sample output** An example of the model output (in this case, to the model was submitted a text (consists of instructions, examples and a user's query) in the total amount of ~2k tokens (hence the long processing time of the request)): ```plaintext User's request: Prove that there is a connection between nails and hunger. Model's response: Task Statement: Argumentatively prove or refute the following logical statement: "There is a connection between nails and hunger." Logical Reasoning Solution: 1. Analyze the Task Statement: The task statement asks to argueatively prove or refute the statement "There is a connection between nails and hunger." 2. Define the Terms: Nails refer to the keratin-based protective structures on the tips of fingers and toes. Hunger, on the other hand, is a physiological sensation of lacking food and a drive to consume it. 3. Analyze the Relationship: There is no direct physiological or biological connection between nails and hunger. Nails are not directly related to the sensation of hunger or the regulation of food intake. 4. Exploring Potential Connections: It is important to consider indirect or metaphorical connections. However, there is no established indirect connection between nails and hunger. 5. Analyze the Task: The task is to argueatively prove or refute the statement. Since there is no direct or indirect connection between nails and hunger, the statement is refutable. Therefore, the statement "There is a connection between nails and hunger" is refutable, as there is no direct or indirect relationship between nails and hunger. (-------------------------------------------------------------------------------------) (----> Generated in 684.5489482879639 seconds (-------------------------------------------------------------------------------------) ``` ## Notes Even though it's a basic model, when it used with the right context, it does a surprisingly good job of executing instructions. However, it requires some polishing (I mean fine-tuning for downstream tasks). I hope that the authors of the original model will release fine-tuned versions in the near future.