|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
## Overview |
|
|
|
This model is optimized for use with [VLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.). It utilizes a weight-only FP8 Marlin kernel, providing an efficient W8A16 configuration. |
|
|
|
### Key Features of FP8 Marlin |
|
|
|
The NeuralMagic FP8 Marlin kernel achieves impressive efficiency by packing 4 8-bit values into an int32 and performing a 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach yields nearly a **2x speedup** over FP16 on most models while maintaining **near lossless quality**. |
|
|
|
#### FP8 Advantages on NVIDIA GPUs |
|
|
|
On newer NVIDIA GPUs (4090/H100 or later), dedicated FP8 tensor cores and hardware allow fast conversion from FP8 to BF16/FP16, maximizing performance. However, older GPUs lack this specific hardware support, preventing activation quantization if we want to leverage FP8. The Marlin kernel addresses this gap effectively, enabling performance gains on Ampere cards (e.g., 3090, A100) without needing full tensor core support. |
|
|
|
Traditional int8 quantization methods often require extensive overhead for data type conversion between int8 and fp16, making them less efficient for inference. Marlin’s FP8 kernel bypasses this limitation by staying predominantly in FP16, removing the need for such conversions during runtime. |
|
|
|
### Optimizations in the Marlin Kernel |
|
|
|
The Marlin kernel is finely tuned for performance, employing several innovative techniques: |
|
|
|
- **Asynchronous Global Weight Loads**: Uses non-blocking `cuda::memcpy_async` (available since Ampere) to load weights directly into shared memory. This minimizes latency by overlapping data transfers with computation. |
|
|
|
- **Circular Shared Memory Queue**: A cyclic buffer system enables uninterrupted data loading, processing, and unloading, ensuring continuous computational flow without stalling. |
|
|
|
- **Optimized Task Scheduling and Synchronization**: Utilizes Stream-K parallelization with non-uniform partitioning, optimizing GPU utilization by minimizing idle time and efficiently managing work distribution across Streaming Multiprocessors (SMs). |
|
|
|
These optimizations enable GPUs like the 3090 and A100 to deliver near FP8 performance with minimal sacrifices, making the Marlin kernel highly effective on non-Ada cards. |
|
|
|
### FP8 Marlin Details |
|
|
|
- Developed by [Michael Goin and the Neural Magic team](https://github.com/vllm-project/vllm/pull/5975), FP8 Marlin is specifically designed for NVIDIA’s GPU architecture, providing a compact and high-performance format. |
|
- FP8 achieves nearly lossless compression, making it suitable for scenarios where quantization errors in traditional int4 or int8 formats might degrade performance. |
|
|
|
### Why FP8? |
|
|
|
This FP8-quantized model was uploaded to explore high-precision quantization. Traditional int4 quantization, as seen in models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4`, can sometimes produce poor outputs with repeated tokens due to quantization errors. In contrast, FP8 does not require calibration data and achieves robust, lossless compression. |
|
|
|
As shown in Neural Magic's recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has limited fidelity recovery from FP16 without careful calibration. FP8, especially in the W8A16 format, maintains high-quality outputs without calibration, making it ideal for high-precision applications such as code generation. |
|
|
|
### How to Quantize your own models to FP-8 W8A16? |
|
|
|
Included in this is a script that will convert the weights of any HF model to W8A16. (TBH its kinda glitched and makes two dupes to the disk if any one wants to fix it feel free to submit a pr but if its aint broke im not gonna fix it) |
|
|
|
How to use the script: |
|
|
|
Have VLLM installed and run 'pip install llmcompressor==0.1.0'. |
|
|
|
Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu. |
|
|
|
## How to Run |
|
|
|
To launch the API server for this model, use the following command: |
|
|
|
```bash |
|
python3 -m vllm.entrypoints.openai.api_server \ |
|
--model Vezora/QwQ-32B-Preview-fp8-W8A16 \ |
|
--dtype auto \ |
|
--api-key token-abc123 \ |
|
--quantization compressed-tensors \ |
|
--max-num-batched-tokens 16384 \ |
|
--max-model-len 16384 \ |
|
--tensor-parallel-size 2 \ |
|
--gpu-memory-utilization 0.99 |
|
|