neuralmagic
/

Sparse-Llama-3.1-8B-ultrachat_200k-2of4

Text Generation

Model card Files Files and versions Community

Sparse-Llama-3.1-8B-ultrachat_200k-2of4 / README.md

alexmarques's picture

Create README.md

0dfc961 verified about 1 month ago

|

2.53 kB

	---
	tags:
	- vllm
	- sparsity
	pipeline_tag: text-generation
	license: llama3.1
	base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4
	---

	# Sparse-Llama-3.1-8B-ultrachat_200k-2of4

	## Model Overview
	- Model Architecture: Llama-3.1-8B
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Sparsity: 2:4
	- Release Date: 11/21/2024
	- Version: 1.0
	- License(s): [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
	- Model Developers: Neural Magic

	This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) on the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
	On the [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) benchmark (version 1), it achieves a score of 61.1, compared to 62.0 for the fine-tuned dense model [Llama-3.1-8B-ultrachat_200k](https://huggingface.co/neuralmagic/Llama-3.1-8B-ultrachat_200k) — demonstrating a 99.4% accuracy recovery.


	### Model Optimizations

	This inherits the optimizations from its parent, [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4).
	Namely, all linear operators within transformer blocks were pruned to the 2:4 sparsity pattern: in each group of four weights, two are retained while two are pruned.


	## Deployment with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.


	## Evaluation

	This model was evaluated on Neural Magic's fork of [AlpacaEval](https://github.com/neuralmagic/alpaca_eval) benchmark.
	We adopt the same setup as in [Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment](https://arxiv.org/abs/2405.03594), using version 1 of the benchmark and [Llama-2-70b-chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) as the annotator.

	### Accuracy
	#### AlpacaEval Benchmark
	<table>
	<tr>
	<td><strong>Metric</strong></td>
	<td style="text-align: center"><strong>Llama-3.1-8B-ultrachat_200k</strong></td>
	<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-ultrachat_200k-2of4</strong></td>
	</tr>
	<tr>
	<td>Win rate</td>
	<td style="text-align: center">62.0</td>
	<td style="text-align: center">61.1</td>
	</tr>
	</table>