Sparse Finetuning for Inference Acceleration of Large Language Models Paper • 2310.06927 • Published Oct 10, 2023 • 14
Towards End-to-end 4-Bit Inference on Generative Large Language Models Paper • 2310.09259 • Published Oct 13, 2023 • 1
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Paper • 2306.03078 • Published Jun 5, 2023 • 3
RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation Paper • 2401.04679 • Published Jan 9 • 2
Extreme Compression of Large Language Models via Additive Quantization Paper • 2401.06118 • Published Jan 11 • 12
Accurate Neural Network Pruning Requires Rethinking Sparse Optimization Paper • 2308.02060 • Published Aug 3, 2023 • 1
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization Paper • 2404.03605 • Published Apr 4 • 1
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Paper • 2208.11580 • Published Aug 24, 2022
GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods Paper • 2210.06384 • Published Oct 12, 2022 • 1
L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning Paper • 2210.17357 • Published Oct 31, 2022 • 1
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment Paper • 2405.03594 • Published May 6 • 7
PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression Paper • 2405.14852 • Published May 23 • 1
MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence Paper • 2405.15593 • Published May 24 • 1
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot Paper • 2301.00774 • Published Jan 2, 2023 • 3
Quantized Distributed Training of Large Models with Convergence Guarantees Paper • 2302.02390 • Published Feb 5, 2023
SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks Paper • 2302.04852 • Published Feb 9, 2023