arxiv:2402.17946

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Published on Feb 28, 2024

Authors:

Abstract

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global <PRE_TAG>pruning</POST_TAG> is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global <PRE_TAG>pruning</POST_TAG> process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-<PRE_TAG>sparsity regimes</POST_TAG> where it surpasses current state-of-the-art methods.