File size: 6,113 Bytes
fbc5ed8 fcd3323 fbc5ed8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
# π SuperNova Medius Compressed Model (W4A16)
[![Model Size](https://img.shields.io/badge/Size-Compressed-green)]()
[![Quantization](https://img.shields.io/badge/Quantization-W4A16-blue)]()
[![Max Sequence Length](https://img.shields.io/badge/Max%20Length-4096-orange)]()
> **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16`
## π Table of Contents
- [Overview](#overview)
- [Quick Start](#quick-start)
- [Model Details](#model-details)
- [Usage Guide](#usage-guide)
- [Quantization Process](#quantization-process)
- [Performance & Benchmarks](#performance--benchmarks)
- [Technical Details](#technical-details)
- [Limitations & Biases](#limitations--biases)
- [Citations & Acknowledgements](#citations--acknowledgements)
## π Overview
SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.
### β¨ Key Features
- 4-bit weight quantization
- 16-bit activation quantization
- 4096 token context window
- Optimized for deployment on consumer hardware
## π Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
# Simple inference
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## π Model Details
### Specifications
- **Base Model**: arcee-ai/SuperNova-Medius
- **Quantization Method**: GPTQ
- **Maximum Sequence Length**: 4096
- **Calibration Samples**: 1024
### Quantization Parameters
| Parameter | Value |
|-----------|--------|
| Weight Bits | 4 |
| Activation Bits | 16 |
| Ignored Layers | lm_head |
| Dampening Fraction | 0.1 |
| Calibration Dataset | neuralmagic/LLM_compression_calibration |
## π» Usage Guide
### Basic Usage
See Quick Start section above.
### Advanced Usage
```python
# Advanced generation with parameters
output = model.generate(
input_ids,
max_length=100,
num_beams=4,
temperature=0.7,
no_repeat_ngram_size=2,
do_sample=True
)
```
### Memory Optimization
```python
# Load model with device map for multi-GPU setup
model = AutoModelForCausalLM.from_pretrained(
"arcee-ai/SuperNova-Medius-CM-w4a16",
device_map="auto",
torch_dtype=torch.bfloat16
)
```
## βοΈ Quantization Process
```python
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
# Configuration
MODEL_ID = "arcee-ai/SuperNova-Medius"
NUM_SAMPLES = 1024
MAX_LENGTH = 4096
SEED = 42
# Calculate device map
device_map = calculate_offload_device_map(
MODEL_ID,
num_gpus=torch.cuda.device_count(),
reserve_for_hessians=True,
torch_dtype=torch.bfloat16
)
# Load model and tokenizer
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Prepare calibration dataset
ds = load_dataset("neuralmagic/LLM_compression_calibration")
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_LENGTH,
truncation=True,
add_special_tokens=False
)
ds = ds.map(tokenize)
# Configure quantization
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
dampening_frac=0.1
)
# Execute quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
oneshot_device=device_map,
max_seq_length=MAX_LENGTH,
num_calibration_samples=NUM_SAMPLES,
accelerator_config={
'split_batches': True,
'dispatch_batches': None,
'even_batches': True,
'use_seedable_sampler': True,
'non_blocking': False,
'gradient_accumulation_kwargs': None,
'use_configured_state': False
}
)
# Save quantized model
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")
```
## π οΈ Technical Details
### Dependencies
| Package | Version |
|---------|---------|
| Python | 3.9.x |
| torch | 2.5.1 |
| transformers | 4.46.2 |
| llmcompressor | 0.5.0 |
| vllm | 0.6.4 |
| datasets | 3.1.0 |
| huggingface_hub | 0.24.7 |
| compressed-tensors | 0.8.0 |
### Hardware Requirements
- **Minimum**: 8GB VRAM
- **Recommended**: 16GB VRAM
- **Optimal**: 24GB VRAM or multiple GPUs
## β οΈ Limitations & Biases
### Known Limitations
- Slight performance degradation compared to full-precision model
- Limited to 4096 token context window
- May require careful memory management on consumer GPUs
### Inherited Biases
- Carries over biases from base model
- Users should implement appropriate content filtering
- Regular evaluation recommended for production deployments
## π Citations & Acknowledgements
### Citation
```bibtex
@misc{SuperNovaMediusCMW4A16,
author = {Edward Kim and Jaro Uljanovs},
title = {SuperNova Medius Compressed Model W4A16},
year = {2024},
howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}},
}
```
### π Acknowledgements
- Original Model: arcee-ai/SuperNova-Medius
- Quantization Tools: LLM Compressor
- Contributors: Edward Kim and Jaro Uljanovs
---
## π Version History
- v1.0.0 (2024-03): Initial release
- v1.0.1 (2024-03): Documentation updates |