license: other
Model Card for llama-30b-hf-53q_4bit-128g_WVU
Model Description
llama-30b-hf-53q_4bit-128g_WVU
is a model based on the
Llama architecture with 30 billion parameters.
This model adopts a quantization in which the first 53 layers
of the decoder have been quantized with the gptq
method,
which uses 4-bit precision and 128 groups.
Then, the last 7 decoder layers (1/8 of decoding layers), and lm_head have been fine-tuned using the wizard_vicuna_70k_unfiltered dataset, 1 epoch.
Note
Quantization effectively reduces memory usage, however, it may result in differences in the parameters. Additionally, fine-tuning only the last few layers lowers memory requirements for training but could lead to minor performance degradation.
Several alternatives exist for fine-tuning and quantizing the Llama models. The specific method utilized here—quantizing several layers, followed by fine-tuning the last few layers—is designed to account for errors introduced during quantization (which sometimes can result in unexpected answers), and enables the last few layers to be fine-tuned considering both the quantization error and the dataset.
It is worth mentioning that other methods may yield superior performance. For instance:
- Fine-tuning the entire model for
X
epochs - Quantizing the first
K
layers - Fine-tuning the remaining layers for
Y
epochs
Nonetheless, as fine-tuning the entire model requires considerable resources (for example, 4 GPUs with 80GB VRAM is required for 7B LLaMa), this model omit the first step from the method described above, and it works.
Using the Model
To load the model, a custom LlamaForCausalLM
is required.
You can find quantized llama here.
References
- Meta - LLaMA
- WizardLM
- GPTQ for LLaMa
- Wizard Vicuna Unfiltered Dataset
- Various unlisted but great works, researches, and projects.