Edit model card

Yi-34B-200K - AWQ

This is a quantized (AWQ) version of Yi-34B-200K.

For more information about the model, see the original page.

Quantization

Quantization was performed using casper-hansen/AutoAWQ.

The Orca dataset was used to weigh the attention activations.

Quantization code

from awq            import AutoAWQForCausalLM
from datasets       import load_dataset
from random         import randint
from transformers   import AutoTokenizer

dataset_size    = 20_000
model_path      = '/PATH/TO/ORIGINAL/YI-34B-200K'
quant_path      = 'quant'
quant_config    = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model           = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer       = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

def load_orca():
    data = load_dataset('Open-Orca/OpenOrca', split='train')
    def concatenate_data(x):
        s   = str(x['system_prompt']).strip()
        q   = str(x['question']).strip().split('Answer:')[0].strip()
        a   = str(x['response']).strip()
        return '{}\n\nHuman:\n{}\n\nAssistant:\n{}\n'.format(s, q, a)
    concatenated = [concatenate_data(data[randint(0, len(data))]) for _ in range(dataset_size)]
    return [text for text in concatenated]

model.quantize(tokenizer, quant_config=quant_config, calib_data=load_orca())

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

The model has been successfully tested.

Inferencing

The model feels great on the RTX4090 (consuming more than 95% of VRAM), but get ready to be patient with a relatively large context size.

During testing, the following query template was used:

# ...
template    = f'{system_prompt}\n\nHuman:\n{user_prompt}\n\nAssistant:\n'
# ...

Sample output

An example of the model output (in this case, to the model was submitted a text (consists of instructions, examples and a user's query) in the total amount of ~2k tokens (hence the long processing time of the request)):

User's request:
Prove that there is a connection between nails and hunger.


Model's response:
Task Statement: Argumentatively prove or refute the following logical statement: "There is a connection between nails and hunger."

Logical Reasoning Solution:
1. Analyze the Task Statement: The task statement asks to argueatively prove or refute the statement "There is a connection between nails and hunger."
2. Define the Terms: Nails refer to the keratin-based protective structures on the tips of fingers and toes. Hunger, on the other hand, is a physiological sensation of lacking food and a drive to consume it.
3. Analyze the Relationship: There is no direct physiological or biological connection between nails and hunger. Nails are not directly related to the sensation of hunger or the regulation of food intake.
4. Exploring Potential Connections: It is important to consider indirect or metaphorical connections. However, there is no established indirect connection between nails and hunger.
5. Analyze the Task: The task is to argueatively prove or refute the statement. Since there is no direct or indirect connection between nails and hunger, the statement is refutable.

Therefore, the statement "There is a connection between nails and hunger" is refutable, as there is no direct or indirect relationship between nails and hunger.
(-------------------------------------------------------------------------------------)
(----> Generated in 684.5489482879639 seconds
(-------------------------------------------------------------------------------------)

Notes

Even though it's a basic model, when it used with the right context, it does a surprisingly good job of executing instructions. However, it requires some polishing (I mean fine-tuning for downstream tasks).

I hope that the authors of the original model will release fine-tuned versions in the near future.

Downloads last month
3
Safetensors
Model size
5.4B params
Tensor type
I32
·
FP16
·
Inference Examples
Inference API (serverless) has been turned off for this model.

Quantized from