π SuperNova Medius Compressed Model (W4A16)
Model ID:
arcee-ai/SuperNova-Medius-CM-w4a16
π Table of Contents
- Overview
- Quick Start
- Model Details
- Usage Guide
- Quantization Process
- Performance & Benchmarks
- Technical Details
- Limitations & Biases
- Citations & Acknowledgements
π Overview
SuperNova Medius CM W4A16 is a quantized version of the arcee-ai/SuperNova-Medius
model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.
β¨ Key Features
- 4-bit weight quantization
- 16-bit activation quantization
- 4096 token context window
- Optimized for deployment on consumer hardware
π Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
# Simple inference
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Model Details
Specifications
- Base Model: arcee-ai/SuperNova-Medius
- Quantization Method: GPTQ
- Maximum Sequence Length: 4096
- Calibration Samples: 1024
Quantization Parameters
Parameter | Value |
---|---|
Weight Bits | 4 |
Activation Bits | 16 |
Ignored Layers | lm_head |
Dampening Fraction | 0.1 |
Calibration Dataset | neuralmagic/LLM_compression_calibration |
π» Usage Guide
Basic Usage
See Quick Start section above.
Advanced Usage
# Advanced generation with parameters
output = model.generate(
input_ids,
max_length=100,
num_beams=4,
temperature=0.7,
no_repeat_ngram_size=2,
do_sample=True
)
Memory Optimization
# Load model with device map for multi-GPU setup
model = AutoModelForCausalLM.from_pretrained(
"arcee-ai/SuperNova-Medius-CM-w4a16",
device_map="auto",
torch_dtype=torch.bfloat16
)
βοΈ Quantization Process
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
# Configuration
MODEL_ID = "arcee-ai/SuperNova-Medius"
NUM_SAMPLES = 1024
MAX_LENGTH = 4096
SEED = 42
# Calculate device map
device_map = calculate_offload_device_map(
MODEL_ID,
num_gpus=torch.cuda.device_count(),
reserve_for_hessians=True,
torch_dtype=torch.bfloat16
)
# Load model and tokenizer
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Prepare calibration dataset
ds = load_dataset("neuralmagic/LLM_compression_calibration")
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_LENGTH,
truncation=True,
add_special_tokens=False
)
ds = ds.map(tokenize)
# Configure quantization
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
dampening_frac=0.1
)
# Execute quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
oneshot_device=device_map,
max_seq_length=MAX_LENGTH,
num_calibration_samples=NUM_SAMPLES,
accelerator_config={
'split_batches': True,
'dispatch_batches': None,
'even_batches': True,
'use_seedable_sampler': True,
'non_blocking': False,
'gradient_accumulation_kwargs': None,
'use_configured_state': False
}
)
# Save quantized model
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")
π οΈ Technical Details
Dependencies
Package | Version |
---|---|
Python | 3.9.x |
torch | 2.5.1 |
transformers | 4.46.2 |
llmcompressor | 0.5.0 |
vllm | 0.6.4 |
datasets | 3.1.0 |
huggingface_hub | 0.24.7 |
compressed-tensors | 0.8.0 |
Hardware Requirements
- Minimum: 8GB VRAM
- Recommended: 16GB VRAM
- Optimal: 24GB VRAM or multiple GPUs
β οΈ Limitations & Biases
Known Limitations
- Slight performance degradation compared to full-precision model
- Limited to 4096 token context window
- May require careful memory management on consumer GPUs
Inherited Biases
- Carries over biases from base model
- Users should implement appropriate content filtering
- Regular evaluation recommended for production deployments
π Citations & Acknowledgements
Citation
@misc{SuperNovaMediusCMW4A16,
author = {Edward Kim and Jaro Uljanovs},
title = {SuperNova Medius Compressed Model W4A16},
year = {2024},
howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}},
}
π Acknowledgements
- Original Model: arcee-ai/SuperNova-Medius
- Quantization Tools: LLM Compressor
- Contributors: Edward Kim and Jaro Uljanovs
π Version History
- v1.0.0 (2024-03): Initial release
- v1.0.1 (2024-03): Documentation updates