MaskLLM: Learnable Semi-structured Sparsity for Large Language Models

This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. The proposed method is scalable and stands to benefit from larger training datasets.

Requirements

We provide pre-computed masks for Huggingface Models such as Llama-2 7B and Llama-3 8B with the minimum requirements. It will not involve docker, Megatron or data preprocessing.

pip install transformers accelerate datasets SentencePiece 

Pre-computed Masks

The following masks were trained and provided by @VainF. We use huggingface_hub to automatically download those masks and apply them to offcical LLMs for evaluation. Those mask files were compressed using numpy.savez_compressed. More results for baselines (SparseGPT, Wanda) can be found in the appendix.

Model Pattern Training Data Training/Eval SeqLen PPL (Dense) PPL (SparseGPT) PPL (MaskLLM) Link
LLaMA-2 7B 2:4 C4 (2B Tokens) 4096 5.12 10.42 6.78 HuggingFace
LLaMA-3 8B 2:4 C4 (2B Tokens) 4096 5.75 17.64 8.49 HuggingFace
LLaMA-3.1 8B 2:4 C4 (2B Tokens) 4096 5.89 18.65 8.58 HuggingFace

How to use it

Please see NVlabs/MaskLLM.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for Vinnnf/LLaMA-3-8B-MaskLLM-C4

Finetuned
(372)
this model

Collection including Vinnnf/LLaMA-3-8B-MaskLLM-C4