Model Card for 4-bit RoLlama3.1-8b-Instruct-DPO

Built from RoLlama3.1-8b-Instruct-DPO, quantized to 4-bit.

This variant of RoLlama3.1-8b-Instruct-DPO provides a reduced footprint through 4-bit quantization, aimed at enabling usage on resource-constrained GPUs while preserving a high fraction of the model’s capabilities.

Model Details

Comparison to 16 bit

It loooks that the effects of the quantization are minimal :

Task	Metric	FP16 Original	4-bit	Absolute Diff.	% Change
ARC Challenge	Avg. Accuracy	44.84	42.74	-2.10	-4.68%
MMLU	Avg. Accuracy	55.06	42.27	-12.79	-23.23%
Winogrande	Avg. Accuracy	65.87	64.94	-0.93	-1.41%
Hellaswag	Avg. Accuracy	58.67	52.39	-6.28	-10.70%
GSM8K	Avg. Accuracy	44.17	38.87	-5.30	-11.99%
TruthfulQA	Avg. Accuracy	47.82	48.67	+0.85	+1.78%
LaRoSeDa (binary)	Macro-F1	96.10	97.47	+1.37	+1.43%
LaRoSeDa (multiclass)	Macro-F1	55.37	64.05	+8.68	+15.68%
WMT EN-RO	BLEU	21.29	20.54	-0.75	-3.52%
WMT RO-EN	BLEU	21.86	21.16	-0.70	-3.20%
XQuAD (avg)	EM / F1	21.58 / 36.54	21.45 / 37.73	~-0.13 / +1.19	-0.60% / +3.26%
STS (avg)	Spearman / Pearson	78.01 / 77.98	77.08 / 76.93	-0.93 / -1.05	-1.19% / -1.35%

Model Description

Developed by: OpenLLM-Ro
Language(s): Romanian
License: cc-by-nc-4.0
Quantized from model: RoLlama3.1-8b-Instruct-DPO
Quantization: 4-bit

Quantization reduces model size and improves inference speed but can lead to small drops in performance. Below is a comprehensive table of the main benchmarks comparing the original full-precision version with the new 4-bit variant.

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "OpenLLM-Ro/RoLlama3.1-8b-Instruct-DPO-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

instruction = "Ce jocuri de societate pot juca cu prietenii mei?"
chat = [
    {"role": "system", "content": "Ești un asistent folositor, respectuos și onest. Încearcă să ajuți cât mai mult prin informațiile oferite, excluzând răspunsuri toxice, rasiste, sexiste, periculoase și ilegale."},
    {"role": "user", "content": instruction},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, system_message="")

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

OpenLLM-Ro
/

RoLlama3.1-8b-Instruct-DPO-4Bit-BB

Model Card for 4-bit RoLlama3.1-8b-Instruct-DPO

Model Details

Comparison to 16 bit

Model Description

How to Use

Model tree for OpenLLM-Ro/RoLlama3.1-8b-Instruct-DPO-4Bit-BB

Dataset used to train OpenLLM-Ro/RoLlama3.1-8b-Instruct-DPO-4Bit-BB

Evaluation results