MaskLLM: Learnable Semi-structured Sparsity for Large Language Models

This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. The proposed method is scalable and stands to benefit from larger training datasets.

Requirements

We provide pre-computed masks for Huggingface Models such as Llama-2 7B and Llama-3 8B with the minimum requirements. It will not involve docker, Megatron or data preprocessing.

pip install transformers accelerate datasets SentencePiece

Pre-computed Masks

The following masks were trained and provided by @VainF. We use huggingface_hub to automatically download those masks and apply them to offcical LLMs for evaluation. Those mask files were compressed using numpy.savez_compressed. More results for baselines (SparseGPT, Wanda) can be found in the appendix.

Model	Pattern	Training Data	Training/Eval SeqLen	PPL (Dense)	PPL (SparseGPT)	PPL (MaskLLM)	Link
LLaMA-2 7B	2:4	C4 (2B Tokens)	4096	5.12	10.42	6.78	HuggingFace
LLaMA-3 8B	2:4	C4 (2B Tokens)	4096	5.75	17.64	8.49	HuggingFace
LLaMA-3.1 8B	2:4	C4 (2B Tokens)	4096	5.89	18.65	8.58	HuggingFace

How to use it

Please see NVlabs/MaskLLM.

Vinnnf
/

LLaMA-3-8B-MaskLLM-C4

MaskLLM: Learnable Semi-structured Sparsity for Large Language Models

Requirements

Pre-computed Masks

How to use it

Model tree for Vinnnf/LLaMA-3-8B-MaskLLM-C4

Collection including Vinnnf/LLaMA-3-8B-MaskLLM-C4

MaskLLM