File size: 6,113 Bytes
fbc5ed8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcd3323
fbc5ed8
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# πŸš€ SuperNova Medius Compressed Model (W4A16)

[![Model Size](https://img.shields.io/badge/Size-Compressed-green)]()
[![Quantization](https://img.shields.io/badge/Quantization-W4A16-blue)]()
[![Max Sequence Length](https://img.shields.io/badge/Max%20Length-4096-orange)]()

> **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16`

## πŸ“‹ Table of Contents
- [Overview](#overview)
- [Quick Start](#quick-start)
- [Model Details](#model-details)
- [Usage Guide](#usage-guide)
- [Quantization Process](#quantization-process)
- [Performance & Benchmarks](#performance--benchmarks)
- [Technical Details](#technical-details)
- [Limitations & Biases](#limitations--biases)
- [Citations & Acknowledgements](#citations--acknowledgements)

## πŸ” Overview

SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.

### ✨ Key Features
- 4-bit weight quantization
- 16-bit activation quantization
- 4096 token context window
- Optimized for deployment on consumer hardware

## πŸš€ Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")

# Simple inference
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## πŸ“Š Model Details

### Specifications
- **Base Model**: arcee-ai/SuperNova-Medius
- **Quantization Method**: GPTQ
- **Maximum Sequence Length**: 4096
- **Calibration Samples**: 1024

### Quantization Parameters
| Parameter | Value |
|-----------|--------|
| Weight Bits | 4 |
| Activation Bits | 16 |
| Ignored Layers | lm_head |
| Dampening Fraction | 0.1 |
| Calibration Dataset | neuralmagic/LLM_compression_calibration |

## πŸ’» Usage Guide

### Basic Usage
See Quick Start section above.

### Advanced Usage

```python
# Advanced generation with parameters
output = model.generate(
    input_ids,
    max_length=100,
    num_beams=4,
    temperature=0.7,
    no_repeat_ngram_size=2,
    do_sample=True
)
```

### Memory Optimization

```python
# Load model with device map for multi-GPU setup
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/SuperNova-Medius-CM-w4a16",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
```

## βš™οΈ Quantization Process

```python
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# Configuration
MODEL_ID = "arcee-ai/SuperNova-Medius"
NUM_SAMPLES = 1024
MAX_LENGTH = 4096
SEED = 42

# Calculate device map
device_map = calculate_offload_device_map(
    MODEL_ID,
    num_gpus=torch.cuda.device_count(),
    reserve_for_hessians=True,
    torch_dtype=torch.bfloat16
)

# Load model and tokenizer
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Prepare calibration dataset
ds = load_dataset("neuralmagic/LLM_compression_calibration")
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_LENGTH,
        truncation=True,
        add_special_tokens=False
    )

ds = ds.map(tokenize)

# Configure quantization
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    dampening_frac=0.1
)

# Execute quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    oneshot_device=device_map,
    max_seq_length=MAX_LENGTH,
    num_calibration_samples=NUM_SAMPLES,
    accelerator_config={
        'split_batches': True,
        'dispatch_batches': None,
        'even_batches': True,
        'use_seedable_sampler': True,
        'non_blocking': False,
        'gradient_accumulation_kwargs': None,
        'use_configured_state': False
    }
)

# Save quantized model
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")
```

## πŸ› οΈ Technical Details

### Dependencies
| Package | Version |
|---------|---------|
| Python | 3.9.x |
| torch | 2.5.1 |
| transformers | 4.46.2 |
| llmcompressor | 0.5.0 |
| vllm | 0.6.4 |
| datasets | 3.1.0 |
| huggingface_hub | 0.24.7 |
| compressed-tensors | 0.8.0 |

### Hardware Requirements
- **Minimum**: 8GB VRAM
- **Recommended**: 16GB VRAM
- **Optimal**: 24GB VRAM or multiple GPUs

## ⚠️ Limitations & Biases

### Known Limitations
- Slight performance degradation compared to full-precision model
- Limited to 4096 token context window
- May require careful memory management on consumer GPUs

### Inherited Biases
- Carries over biases from base model
- Users should implement appropriate content filtering
- Regular evaluation recommended for production deployments

## πŸ“š Citations & Acknowledgements

### Citation

```bibtex
@misc{SuperNovaMediusCMW4A16,
  author = {Edward Kim and Jaro Uljanovs},
  title = {SuperNova Medius Compressed Model W4A16},
  year = {2024},
  howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}},
}
```

### πŸ‘ Acknowledgements
- Original Model: arcee-ai/SuperNova-Medius
- Quantization Tools: LLM Compressor
- Contributors: Edward Kim and Jaro Uljanovs

---

## πŸ“ Version History

- v1.0.0 (2024-03): Initial release
- v1.0.1 (2024-03): Documentation updates