alexmarques's picture
Create README.md
0dfc961 verified
|
raw
history blame
2.53 kB
metadata
tags:
  - vllm
  - sparsity
pipeline_tag: text-generation
license: llama3.1
base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4

Sparse-Llama-3.1-8B-ultrachat_200k-2of4

Model Overview

  • Model Architecture: Llama-3.1-8B
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Sparsity: 2:4
  • Release Date: 11/21/2024
  • Version: 1.0
  • License(s): llama3.1
  • Model Developers: Neural Magic

This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the ultrachat_200k dataset. On the AlpacaEval benchmark (version 1), it achieves a score of 61.1, compared to 62.0 for the fine-tuned dense model Llama-3.1-8B-ultrachat_200k — demonstrating a 99.4% accuracy recovery.

Model Optimizations

This inherits the optimizations from its parent, Sparse-Llama-3.1-8B-2of4. Namely, all linear operators within transformer blocks were pruned to the 2:4 sparsity pattern: in each group of four weights, two are retained while two are pruned.

Deployment with vLLM

This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.

Evaluation

This model was evaluated on Neural Magic's fork of AlpacaEval benchmark. We adopt the same setup as in Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, using version 1 of the benchmark and Llama-2-70b-chat as the annotator.

Accuracy

AlpacaEval Benchmark

Metric Llama-3.1-8B-ultrachat_200k Sparse-Llama-3.1-8B-ultrachat_200k-2of4
Win rate 62.0 61.1