# 🚀 SuperNova Medius Compressed Model (W4A16) [![Model Size](https://img.shields.io/badge/Size-Compressed-green)]() [![Quantization](https://img.shields.io/badge/Quantization-W4A16-blue)]() [![Max Sequence Length](https://img.shields.io/badge/Max%20Length-4096-orange)]() > **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16` ## 📋 Table of Contents - [Overview](#overview) - [Quick Start](#quick-start) - [Model Details](#model-details) - [Usage Guide](#usage-guide) - [Quantization Process](#quantization-process) - [Performance & Benchmarks](#performance--benchmarks) - [Technical Details](#technical-details) - [Limitations & Biases](#limitations--biases) - [Citations & Acknowledgements](#citations--acknowledgements) ## 🔍 Overview SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance. ### ✨ Key Features - 4-bit weight quantization - 16-bit activation quantization - 4096 token context window - Optimized for deployment on consumer hardware ## 🚀 Quick Start ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16") model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16") # Simple inference text = "Hello, how are you?" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## 📊 Model Details ### Specifications - **Base Model**: arcee-ai/SuperNova-Medius - **Quantization Method**: GPTQ - **Maximum Sequence Length**: 4096 - **Calibration Samples**: 1024 ### Quantization Parameters | Parameter | Value | |-----------|--------| | Weight Bits | 4 | | Activation Bits | 16 | | Ignored Layers | lm_head | | Dampening Fraction | 0.1 | | Calibration Dataset | neuralmagic/LLM_compression_calibration | ## 💻 Usage Guide ### Basic Usage See Quick Start section above. ### Advanced Usage ```python # Advanced generation with parameters output = model.generate( input_ids, max_length=100, num_beams=4, temperature=0.7, no_repeat_ngram_size=2, do_sample=True ) ``` ### Memory Optimization ```python # Load model with device map for multi-GPU setup model = AutoModelForCausalLM.from_pretrained( "arcee-ai/SuperNova-Medius-CM-w4a16", device_map="auto", torch_dtype=torch.bfloat16 ) ``` ## ⚙️ Quantization Process ```python import torch from datasets import load_dataset from transformers import AutoTokenizer from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot from llmcompressor.transformers.compression.helpers import calculate_offload_device_map # Configuration MODEL_ID = "arcee-ai/SuperNova-Medius" NUM_SAMPLES = 1024 MAX_LENGTH = 4096 SEED = 42 # Calculate device map device_map = calculate_offload_device_map( MODEL_ID, num_gpus=torch.cuda.device_count(), reserve_for_hessians=True, torch_dtype=torch.bfloat16 ) # Load model and tokenizer model = SparseAutoModelForCausalLM.from_pretrained( MODEL_ID, device_map=device_map, torch_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) # Prepare calibration dataset ds = load_dataset("neuralmagic/LLM_compression_calibration") ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES)) def preprocess(example): return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} ds = ds.map(preprocess) def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_LENGTH, truncation=True, add_special_tokens=False ) ds = ds.map(tokenize) # Configure quantization recipe = GPTQModifier( targets="Linear", scheme="W4A16", ignore=["lm_head"], dampening_frac=0.1 ) # Execute quantization oneshot( model=model, dataset=ds, recipe=recipe, oneshot_device=device_map, max_seq_length=MAX_LENGTH, num_calibration_samples=NUM_SAMPLES, accelerator_config={ 'split_batches': True, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False } ) # Save quantized model model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True) tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16") ``` ## 🛠️ Technical Details ### Dependencies | Package | Version | |---------|---------| | Python | 3.9.x | | torch | 2.5.1 | | transformers | 4.46.2 | | llmcompressor | 0.5.0 | | vllm | 0.6.4 | | datasets | 3.1.0 | | huggingface_hub | 0.24.7 | | compressed-tensors | 0.8.0 | ### Hardware Requirements - **Minimum**: 8GB VRAM - **Recommended**: 16GB VRAM - **Optimal**: 24GB VRAM or multiple GPUs ## ⚠️ Limitations & Biases ### Known Limitations - Slight performance degradation compared to full-precision model - Limited to 4096 token context window - May require careful memory management on consumer GPUs ### Inherited Biases - Carries over biases from base model - Users should implement appropriate content filtering - Regular evaluation recommended for production deployments ## 📚 Citations & Acknowledgements ### Citation ```bibtex @misc{SuperNovaMediusCMW4A16, author = {Edward Kim and Jaro Uljanovs}, title = {SuperNova Medius Compressed Model W4A16}, year = {2024}, howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}}, } ``` ### 👏 Acknowledgements - Original Model: arcee-ai/SuperNova-Medius - Quantization Tools: LLM Compressor - Contributors: Edward Kim and Jaro Uljanovs --- ## 📝 Version History - v1.0.0 (2024-03): Initial release - v1.0.1 (2024-03): Documentation updates