hjedwardkim's picture
Update README.md
fcd3323 verified

πŸš€ SuperNova Medius Compressed Model (W4A16)

Model Size Quantization Max Sequence Length

Model ID: arcee-ai/SuperNova-Medius-CM-w4a16

πŸ“‹ Table of Contents

πŸ” Overview

SuperNova Medius CM W4A16 is a quantized version of the arcee-ai/SuperNova-Medius model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.

✨ Key Features

  • 4-bit weight quantization
  • 16-bit activation quantization
  • 4096 token context window
  • Optimized for deployment on consumer hardware

πŸš€ Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")

# Simple inference
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“Š Model Details

Specifications

  • Base Model: arcee-ai/SuperNova-Medius
  • Quantization Method: GPTQ
  • Maximum Sequence Length: 4096
  • Calibration Samples: 1024

Quantization Parameters

Parameter Value
Weight Bits 4
Activation Bits 16
Ignored Layers lm_head
Dampening Fraction 0.1
Calibration Dataset neuralmagic/LLM_compression_calibration

πŸ’» Usage Guide

Basic Usage

See Quick Start section above.

Advanced Usage

# Advanced generation with parameters
output = model.generate(
    input_ids,
    max_length=100,
    num_beams=4,
    temperature=0.7,
    no_repeat_ngram_size=2,
    do_sample=True
)

Memory Optimization

# Load model with device map for multi-GPU setup
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/SuperNova-Medius-CM-w4a16",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

βš™οΈ Quantization Process

import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# Configuration
MODEL_ID = "arcee-ai/SuperNova-Medius"
NUM_SAMPLES = 1024
MAX_LENGTH = 4096
SEED = 42

# Calculate device map
device_map = calculate_offload_device_map(
    MODEL_ID,
    num_gpus=torch.cuda.device_count(),
    reserve_for_hessians=True,
    torch_dtype=torch.bfloat16
)

# Load model and tokenizer
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Prepare calibration dataset
ds = load_dataset("neuralmagic/LLM_compression_calibration")
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_LENGTH,
        truncation=True,
        add_special_tokens=False
    )

ds = ds.map(tokenize)

# Configure quantization
recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    dampening_frac=0.1
)

# Execute quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    oneshot_device=device_map,
    max_seq_length=MAX_LENGTH,
    num_calibration_samples=NUM_SAMPLES,
    accelerator_config={
        'split_batches': True,
        'dispatch_batches': None,
        'even_batches': True,
        'use_seedable_sampler': True,
        'non_blocking': False,
        'gradient_accumulation_kwargs': None,
        'use_configured_state': False
    }
)

# Save quantized model
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")

πŸ› οΈ Technical Details

Dependencies

Package Version
Python 3.9.x
torch 2.5.1
transformers 4.46.2
llmcompressor 0.5.0
vllm 0.6.4
datasets 3.1.0
huggingface_hub 0.24.7
compressed-tensors 0.8.0

Hardware Requirements

  • Minimum: 8GB VRAM
  • Recommended: 16GB VRAM
  • Optimal: 24GB VRAM or multiple GPUs

⚠️ Limitations & Biases

Known Limitations

  • Slight performance degradation compared to full-precision model
  • Limited to 4096 token context window
  • May require careful memory management on consumer GPUs

Inherited Biases

  • Carries over biases from base model
  • Users should implement appropriate content filtering
  • Regular evaluation recommended for production deployments

πŸ“š Citations & Acknowledgements

Citation

@misc{SuperNovaMediusCMW4A16,
  author = {Edward Kim and Jaro Uljanovs},
  title = {SuperNova Medius Compressed Model W4A16},
  year = {2024},
  howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}},
}

πŸ‘ Acknowledgements

  • Original Model: arcee-ai/SuperNova-Medius
  • Quantization Tools: LLM Compressor
  • Contributors: Edward Kim and Jaro Uljanovs

πŸ“ Version History

  • v1.0.0 (2024-03): Initial release
  • v1.0.1 (2024-03): Documentation updates