The meta-llama/Llama-3.1-70B-Instruct model has been quantized using AutoRound and serialized in the GPTQ format at 4-bit precision.
This process achieved an impressive 70% reduction in model size while retaining 99% of its original accuracy, ensuring both efficiency and precision for real-world applications.

How to run

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq"
model = AutoModelForCausalLM.from_pretrained(quantized_model,
                                             device_map="auto")
                                             
tokenizer = AutoTokenizer.from_pretrained(quantized_model)
text = "The patient was admitted to the hospital"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0])

This quantization process was conducted by Sofya to make large-scale language models more accessible.

Downloads last month
21
Safetensors
Model size
11.3B params
Tensor type
BF16
I32
FP16
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq

Quantized
(105)
this model
Adapters
1 model