Poor long-context performance?
I extended the context to 40,000 and ran with llama.cpp, but unfortunately the output was gibberish. The model ran fine on smaller contexts. Is this an artifact of the original model? Also, do you have any recommendations on how to make the model work better for longer contexts?
As far as IQ3_XXS is concerned, it works for me at -c 40000. Which gguf did you try? Did it work for you at lower context?
./build/bin/llama-cli -m /Llama-3_1-Nemotron-51B-Instruct-GGUF/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_XXS.gguf -p 'You are a World History Professor called Niall Ferguson.' -c 40000 -cnv -ngl 5824.04) 13.3.0 for x86_64-linux-gnu
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 4377 (643e5e8a) with cc (Ubuntu 13.3.0-6ubuntu2
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23982 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 630 tensors from /home/user/Llama-3_1-Nemotron-51B-Instruct-GGUF/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_XXS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deci
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3_1 Nemotron 51B Instruct
llama_model_loader: - kv 3: general.finetune str = 3_1-Nemotron-Instruct
llama_model_loader: - kv 4: general.basename str = Llama
llama_model_loader: - kv 5: general.size_label str = 51B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = nvidia-open-model-license
llama_model_loader: - kv 8: general.license.link str = https://developer.download.nvidia.com...
llama_model_loader: - kv 9: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 11: deci.attention.head_count_kv arr[i32,80] = [8, 4, 8, 8, 8, 2, 2, 1, 1, 2, 2, 0, ...
llama_model_loader: - kv 12: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 64, 64, 64, ...
llama_model_loader: - kv 13: deci.feed_forward_length arr[i32,80] = [7168, 14336, 28672, 28672, 28672, 14...
llama_model_loader: - kv 14: deci.block_count u32 = 80
llama_model_loader: - kv 15: deci.context_length u32 = 131072
llama_model_loader: - kv 16: deci.embedding_length u32 = 8192
llama_model_loader: - kv 17: deci.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 18: deci.attention.key_length u32 = 128
llama_model_loader: - kv 19: deci.attention.value_length u32 = 128
llama_model_loader: - kv 20: general.file_type u32 = 23
llama_model_loader: - kv 21: deci.vocab_size u32 = 128256
llama_model_loader: - kv 22: deci.rope.dimension_count u32 = 128
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 30: tokenizer.ggml.eom_token_id u32 = 128008
llama_model_loader: - kv 31: tokenizer.ggml.eot_token_id u32 = 128009
llama_model_loader: - kv 32: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: quantize.imatrix.file str = /home/user/Llama-3_1-Nemotron-51B-Ins...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = /tank/ai/langchain/calibration_datav3...
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 474
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 32
llama_model_loader: - type f32: 154 tensors
llama_model_loader: - type q5_K: 55 tensors
llama_model_loader: - type iq3_xxs: 240 tensors
llama_model_loader: - type iq3_s: 73 tensors
llama_model_loader: - type iq2_s: 108 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deci
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = [64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 64, 64, 0, 64, 0, 64, 64, 0, 64, 64, 0, 0, 64, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64]
llm_load_print_meta: n_head_kv = [8, 4, 8, 8, 8, 2, 2, 1, 1, 2, 2, 0, 1, 2, 2, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 8, 8, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = [8, 16, 8, 8, 8, 32, 32, 64, 64, 32, 32, 0, 64, 32, 32, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 8, 8, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
llm_load_print_meta: n_embd_k_gqa = [1024, 512, 1024, 1024, 1024, 256, 256, 128, 128, 256, 256, 0, 128, 256, 256, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
llm_load_print_meta: n_embd_v_gqa = [1024, 512, 1024, 1024, 1024, 256, 256, 128, 128, 256, 256, 0, 128, 256, 256, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = [7168, 14336, 28672, 28672, 28672, 14336, 14336, 14336, 14336, 14336, 14336, 14336, 14336, 14336, 14336, 7168, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 14336, 28672, 28672, 14336, 28672, 14336, 14336, 14336, 7168, 7168, 28672, 7168, 7168, 7168, 28672, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672]
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = IQ3_XXS - 3.0625 bpw
llm_load_print_meta: model params = 51.50 B
llm_load_print_meta: model size = 18.80 GiB (3.14 BPW)
llm_load_print_meta: general.name = Llama 3_1 Nemotron 51B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128001 '<|end_of_text|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 58 repeating layers to GPU
llm_load_tensors: offloaded 58/81 layers to GPU
llm_load_tensors: CUDA0 model buffer size = 13154.00 MiB
llm_load_tensors: CPU_Mapped model buffer size = 6094.35 MiB
................................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 40000
llama_new_context_with_model: n_ctx_per_seq = 40000
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (40000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 40000, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80
llama_kv_cache_init: CUDA0 KV buffer size = 5312.50 MiB
llama_kv_cache_init: CPU KV buffer size = 1933.59 MiB
llama_new_context_with_model: KV self size = 7246.09 MiB, K (f16): 3623.05 MiB, V (f16): 3623.05 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 5360.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 94.13 MiB
llama_new_context_with_model: graph nodes = 2014
llama_new_context_with_model: graph splits = 237 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 40000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 72485766
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 40000
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 40000, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with ''.
system
You are a World History Professor called Niall Ferguson.
Why was Duke Vladivoj enfeoffed Duchy of Bohemia with the Holy Roman Empire in 1002? Does that mean Duchy of Bohemia was part of the Holy Roman Empire already? If so, when did the Holy Roman Empire acquired Bohemia?
The history of Eastern Europe in the 10th and 11th centuries is quite complex, and I'll do my best to clarify things for you.
The Duchy of Bohemia was indeed a part of the Holy Roman Empire, but it wasn't acquired by the Empire in the classical sense.
In 1002, Duke Vladivoj (also known as Vladivoj of Bohemia) was enfeoffed with the Duchy of Bohemia by the Holy Roman Emperor Otto III. However, this wasn't a straightforward case of the Empire acquiring new territory.
Instead, the Duchy of Bohemia was already a de facto part of the Holy Roman Empire, thanks to a combination of historical circumstances, dynastic alliances, and imperial patronage.
In the 9th and 10th centuries, the Duchy of Bohemia was a part of the Kingdom of Moravia, and later the Kingdom of Bohemia, which was a client state of the Holy Roman Empire.
Over time, the Duchy of Bohemia became increasingly tied to the Holy Roman Empire through dynastic marriages, imperial patronage, and military alliances.
In 1355, the Duchy of Bohemia was elevated to a kingdom by the Holy Roman Emperor Charles IV, who was also the King of Bohemia.
So, to answer your original question, Duke Vladivoj was enfeoffed with the Duchy of Bohemia in 1002 because the Duchy was already a de facto part of the Holy Roman Empire, and the Empire's rulers saw fit to grant Vladivoj control over the Duchy as a vassal of the Empire.
llama_perf_sampler_print: sampling time = 33.76 ms / 396 runs ( 0.09 ms per token, 11728.82 tokens per second)
llama_perf_context_print: load time = 6571.90 ms
llama_perf_context_print: prompt eval time = 10895.87 ms / 82 tokens ( 132.88 ms per token, 7.53 tokens per second)
llama_perf_context_print: eval time = 162951.38 ms / 331 runs ( 492.30 ms per token, 2.03 tokens per second)
llama_perf_context_print: total time = 175857.06 ms / 413 tokens
Thank you for looking into it. I was using Llama-3_1-Nemotron-51B-Instruct.imatrix.Q5_K_M.gguf. Your prompt was very small (82 tokens). Can you try a prompt with many more tokens? Here's an example which outputs gibberish for me (over 16,400 prompt tokens):
Given the following readme file with papers, please rank the top ten most interesting sounding papers:
Papers
Survey
A Survey on Model Compression for Large Language Models
TACL [Paper]
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
EMNLP 2023 [Paper] [Code]
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
Arxiv 2023 [Paper]
Efficient Large Language Models: A Survey
TMLR [Paper] [GitHub Page]
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
ICML 2024 Tutorial [Paper] [Tutorial]
Understanding LLMs: A Comprehensive Overview from Training to Inference
Arxiv 2024 [Paper]
Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward
IJCAI 2024 (Survey Track) [Paper] [GitHub Page]
A Survey of Resource-efficient LLM and Multimodal Foundation Models
Arxiv 2024 [Paper]
A Survey on Hardware Accelerators for Large Language Models
Arxiv 2024 [Paper]
A Comprehensive Survey of Compression Algorithms for Language Models
Arxiv 2024 [Paper]
A Survey on Transformer Compression
Arxiv 2024 [Paper]
Model Compression and Efficient Inference for Large Language Models: A Survey
Arxiv 2024 [Paper]
LLM Inference Unveiled: Survey and Roofline Model Insights
Arxiv 2024 [Paper]
A Survey on Knowledge Distillation of Large Language Models
Arxiv 2024 [Paper] [GitHub Page]
Efficient Prompting Methods for Large Language Models: A Survey
Arxiv 2024 [Paper]
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
Arxiv 2024 [Paper]
On-Device Language Models: A Comprehensive Review
Arxiv 2024 [Paper] [GitHub Page] [Download On-device LLMs]
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
Arxiv 2024 [Paper]
Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey
Arxiv 2024 [Paper]
Prompt Compression for Large Language Models: A Survey
Arxiv 2024 [Paper]
A Comprehensive Study on Quantization Techniques for Large Language Models
Arxiv 2024 [Paper]
Quantization
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NeurIPS 2022 [Paper] [Code (DeepSpeed)]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
NeurIPS 2022 [Paper] [Code]
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
NeurIPS 2022 [Paper] [Code]
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Arxiv 2022 [Paper]
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023 [Paper] [Code]
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
ICML 2023 [Paper] [Code (DeepSpeed)]
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
ICML 2023 [Paper] [Code]
The case for 4-bit precision: k-bit Inference Scaling Laws
ICML 2023 [Paper]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023 [Paper] [Code]
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
ACL 2023 [Paper]
Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
ACL 2023 [Paper]
QLoRA: Efficient Finetuning of Quantized LLMs
NeurIPS 2023 [Paper] [Code]
The Quantization Model of Neural Scaling
NeurIPS 2023 [Paper]
Quantized Distributed Training of Large Models with Convergence Guarantees
ICML 2023 [Paper]
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Arxiv 2023 [Paper] [Code]
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
AAAI 2024 [Paper] [Code]
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Arxiv 2023 [Paper]
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
NeurIPS 2023 [Paper]
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
Arxiv 2023 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MLSys 2024 (Best Paper 🏆) [Paper] [Code]
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
ACL Findings 2024 [Paper] [Code]
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
ICLR 2024 [Paper] [Code]
OWQ: Lessons learned from activation outliers for weight quantization in large language models
AAAI 2024 [Paper]
SqueezeLLM: Dense-and-Sparse Quantization
ICML 2024 [Paper] [Code]
INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
Arxiv 2023 [Paper]
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
ICLR 2024 [Paper]
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
Arxiv 2023 [Paper] [Code]
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
Arxiv 2023 [Paper] [Code]
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
COLING 2024 [Paper]
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Arxiv 2023 [Paper] [Code (DeepSpeed)]
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
ISCA 2023 [Paper]
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Arxiv 2023 [Paper]
GPT-Zip: Deep Compression of Finetuned Large Language Models
ICML 2023 Workshop ES-FoMO [Paper]
Generating Efficient Kernels for Quantized Inference on Large Language Models
ICML 2023 Workshop ES-FoMO [Paper]
Gradient-Based Post-Training Quantization: Challenging the Status Quo
Arxiv 2023 [Paper]
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
Arxiv 2023 [Paper]
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
ICLR 2024 [Paper] [Code]
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Arxiv 2023 [Paper]
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
IEEE Computer Architecture Letters 2023 [Paper]
QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm
Arxiv 2023 [Paper]
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
AAAI 2024 [Paper]
Understanding the Impact of Post-Training Quantization on Large-scale Language Models
Arxiv 2023 [Paper]
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
NAACL 2024 [Paper]
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
EMNLP Findings 2024 [Paper] [Code]
Efficient Post-training Quantization with FP8 Formats
MLSys 2024 [Paper] [Code (Intel® Neural Compressor)]
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
ICLR 2024 [Paper] [Code]
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
ICLR 2024 [Paper] [Code]
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
TMLR (Featured Certification 🌟) [Paper]
PB-LLM: Partially Binarized Large Language Models
ICLR 2024 [Paper] [Code]
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Arxiv 2023 [Paper]
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
ICLR 2024 [Paper] [Code]
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
ICLR 2024 [Paper] [Code]
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
Arxiv 2023 [Paper]
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]
BitNet: Scaling 1-bit Transformers for Large Language Models
Arxiv 2023 [Paper] [Code]
FP8-LM: Training FP8 Large Language Models
Arxiv 2023 [Paper] [Code]
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
EMNLP 2024 [Paper] [Code]
AFPQ: Asymmetric Floating Point Quantization for LLMs
ACL Findings 2024 [Paper] [Code]
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
Arxiv 2023 [Paper]
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
MLSys 2024 [Paper] [Code]
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Arxiv 2023 [Paper]
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Arxiv 2023 [Paper]
How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
Arxiv 2023 [Paper]
A Speed Odyssey for Deployable Quantization of LLMs
Arxiv 2023 [Paper]
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
Arxiv 2023 [Paper]
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
NeurIPS 2023 [Paper] [Code]
Efficient LLM Inference on CPUs
NeurIPS 2023 on Efficient Natural Language and Speech Processing [Paper] [Code]
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
EMNLP Findings 2023 [Paper]
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models
EMNLP 2023 [Paper]
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
EMNLP 2023 [Paper] [Code]
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
EMNLP 2023 [Paper]
Watermarking LLMs with Weight Quantization
EMNLP 2023 [Paper] [Code]
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
EMNLP 2023 [Paper]
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
EMNLP 2023 [Paper] [Code]
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
AAAI 2024 [Paper]
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
Arxiv 2023 [Paper]
CBQ: Cross-Block Quantization for Large Language Models
Arxiv 2023 [Paper]
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Arxiv 2023 [Paper]
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
NeurIPS 2023 [Paper] [Code]
A Performance Evaluation of a Quantized Large Language Model on Various Smartphones
Arxiv 2023 [Paper]
DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
Arxiv 2023 [Paper] [Code]
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
FPGA 2024 [Paper]
Extreme Compression of Large Language Models via Additive Quantization
ICML 2024 [Paper]
Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
Arxiv 2024 [Paper]
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
Arxiv 2024 [Paper]
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
USENIX ATC 2024 [Paper]
Can Large Language Models Understand Context?
Arxiv 2024 [Paper]
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
Arxiv 2024 [Paper] [Code]
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Arxiv 2024 [Paper]
LQER: Low-Rank Quantization Error Reconstruction for LLMs
ICML 2024 [Paper]
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Arxiv 2024 [Paper] [Code]
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
ICML 2024 [Paper] [Code]
L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ
Arxiv 2024 [Paper]
TP-Aware Dequantization
Arxiv 2024 [Paper]
ApiQ: Finetuning of 2-Bit Quantized Large Language Model
EMNLP 2024 [Paper]
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
Arxiv 2024 [Paper] [Code]
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Arxiv 2024 [Paper] [Code]
QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning
EMNLP 2024 Industry Track [Paper]
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
ICML 2024 [Paper]
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
ACL 2024 [Paper] [Code]
OneBit: Towards Extremely Low-bit Large Language Models
Arxiv 2024 [Paper]
DB-LLM: Accurate Dual-Binarization for Efficient LLMs
ACL Findings 2024 [Paper]
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More
Arxiv 2024 [Paper]
GPTVQ: The Blessing of Dimensionality for LLM Quantization
Arxiv 2024 [Paper] [Code]
APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
DAC 2024 [Paper]
A Comprehensive Evaluation of Quantization Strategies for Large Language Models
DAC 2024 [Paper]
Evaluating Quantized Large Language Models
Arxiv 2024 [Paper]
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Arxiv 2024 [Paper]
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
Arxiv 2024 [Paper]
IntactKV: Improving Large Languagze Model Quantization by Keeping Pivot Tokens Intact
ACL Findings 2024 [Paper] [Code]
On the Compressibility of Quantized Large Language Models
Arxiv 2024 [Paper]
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Arxiv 2024 [Paper]
What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
Arxiv 2024 [Paper]
SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression
Arxiv 2024 [Paper] [Code]
AffineQuant: Affine Transformation Quantization for Large Language Models
ICLR 2024 [Paper] [Code]
Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
ICLR Practical ML for Low Resource Settings Workshop 2024 [Paper]
Accurate Block Quantization in LLMs with Outliers
Arxiv 2024 [Paper]
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Arxiv 2024 [Paper] [Code]
Minimize Quantization Output Error with Bias Compensation
Arxiv 2024 [Paper] [Code]
Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
Arxiv 2024 [Paper]
Increased LLM Vulnerabilities from Fine-tuning and Quantization
Arxiv 2024 [Paper]
Quantization of Large Language Models with an Overdetermined Basis
Arxiv 2024 [Paper]
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Arxiv 2024 [Paper] [Code] [Model]
How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training
Arxiv 2024 [Paper]
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
Arxiv 2024 [Paper] [Code]
When Quantization Affects Confidence of Large Language Models?
NAACL 2024 [Paper]
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Arxiv 2024 [Paper] [Code]
Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs
ICML 2024 [Paper]
LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models
Arxiv 2024 [Paper] [Code]
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
Arxiv 2024 [Paper]
Combining multiple post-training techniques to achieve most efficient quantized LLMs
Arxiv 2024 [Paper]
Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization
Arxiv 2024 [Paper]
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
Arxiv 2024 [Paper] [Code]
OAC: Output-adaptive Calibration for Accurate Post-training Quantization
Arxiv 2024 [Paper]
PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression
Arxiv 2024 [Paper]
SpinQuant -- LLM quantization with learned rotations
Arxiv 2024 [Paper]
Compressing Large Language Models using Low Rank and Low Precision Decomposition
Arxiv 2024 [Paper] [Code]
Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information
Arxiv 2024 [Paper]
Exploiting LLM Quantization
Arxiv 2024 [Paper]
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
Arxiv 2024 [Paper]
LCQ: Low-Rank Codebook based Quantization for Large Language Models
Arxiv 2024 [Paper]
LoQT: Low Rank Adapters for Quantized Training
Arxiv 2024 [Paper] [Code]
CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
Arxiv 2024 [Paper] [Code]
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
Arxiv 2024 [Paper]
Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs
Arxiv 2024 [Paper]
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs
NeurIPS 2024 [Paper] [Code]
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Arxiv 2024 [Paper] [Code]
Low-Rank Quantization-Aware Training for LLMs
Arxiv 2024 [Paper]
TernaryLLM: Ternarized Large Language Model
Arxiv 2024 [Paper]
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark
Arxiv 2024 [Paper] [Code]
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models
Arxiv 2024 [Paper]
QQQ: Quality Quattuor-Bit Quantization for Large Language Models
Arxiv 2024 [Paper] [Code]
QTIP: Quantization with Trellises and Incoherence Processing
NeurIPS 2024 [Paper] [Code]
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
EMNLP 2024 [Paper]
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
Arxiv 2024 [Paper]
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
ISCA 2024 [Paper]
SDQ: Sparse Decomposed Quantization for LLM Inference
Arxiv 2024 [Paper]
Attention-aware Post-training Quantization without Backpropagation
Arxiv 2024 [Paper]
EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting
Arxiv 2024 [Paper] [Code]
Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other
Arxiv 2024 [Paper]
Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
Arxiv 2024 [Paper] [Code]
CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Arxiv 2024 [Paper]
OutlierTune: Efficient Channel-Wise Quantization for Large Language Models
Arxiv 2024 [Paper]
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Arxiv 2024 [Paper] [Code]
GPTQT: Quantize Large Language Models Twice to Push the Efficiency
ICORIS 2024 [Paper]
Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment
ACL 2024 [Paper]
How Does Quantization Affect Multilingual LLMs?
EMNLP Findings 2024 [Paper]
RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
EMNLP Findings 2024 [Paper] [Code]
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
Arxiv 2024 [Paper] [Code]
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Arxiv 2024 [Paper] [Code]
Accuracy is Not All You Need
Arxiv 2024 [Paper]
BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks
Arxiv 2024 [Paper]
LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
Arxiv 2024 [Paper]
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
EMNLP Findings 2024 [Paper] [Code]
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Arxiv 2024 [Paper] [Code]
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
Arxiv 2024 [Paper] [Code]
Exploring Quantization for Efficient Pre-Training of Transformer Language Models
EMNLP Findings 2024 [Paper] [Code]
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Arxiv 2024 [Paper] [Code]
Mamba-PTQ: Outlier Channels in Recurrent Large Language Models
Efficient Systems for Foundation Models Workshop @ ICML 2024 [Paper]
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
Arxiv 2024 [Paper]
Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance
Arxiv 2024 [Paper] [Code]
STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
Arxiv 2024 [Paper]
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation
ACM MM 2024 [Paper]
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
Arxiv 2024 [Paper]
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Arxiv 2024 [Paper] [Code (Marlin)] [Code (Sparse Marlin)]
Matmul or No Matmal in the Era of 1-bit LLMs
Arxiv 2024 [Paper]
MobileQuant: Mobile-friendly Quantization for On-device Language Models
EMNLP Findings 2024 [Paper] [Code]
GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs
Arxiv 2024 [Paper] [Code]
Foundations of Large Language Model Compression -- Part 1: Weight Quantization
Arxiv 2024 [Paper] [Code]
OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models
DAC 2024 [Paper]
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
EMNLP 2024 [Paper] [Code]
Scaling FP8 training to trillion-token LLMs
Arxiv 2024 [Paper]
Accumulator-Aware Post-Training Quantization
Arxiv 2024 [Paper]
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Arxiv 2024 [Paper]
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
Arxiv 2024 [Paper] [Code]
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Arxiv 2024 [Paper]
ARB-LLM: Alternating Refined Binarizations for Large Language Models
Arxiv 2024 [Paper] [Code]
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
Arxiv 2024 [Paper] [Code]
SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Arxiv 2024 [Paper]
Scaling Laws for Mixed quantization in Large Language Models
Arxiv 2024 [Paper]
Q-VLM: Post-training Quantization for Large Vision-Language Models
NeurIPS 2024 [Paper] [Code]
CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
Arxiv 2024 [Paper]
FlatQuant: Flatness Matters for LLM Quantization
Arxiv 2024 [Paper] [Code]
DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization
Arxiv 2024 [Paper]
QEFT: Quantization for Efficient Fine-Tuning of LLMs
EMNLP Findings 2024 [Paper] [Code]
Continuous Approximations for Improving Quantization Aware Training of LLMs
Arxiv 2024 [Paper]
DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs
Arxiv 2024 [Paper]
COMET: Towards Partical W4A4KV4 LLMs Serving
Arxiv 2024 [Paper]
Scaling laws for post-training quantized large language models
Arxiv 2024 [Paper]
Channel-Wise Mixed-Precision Quantization for Large Language Models
Arxiv 2024 [Paper]
Understanding the difficulty of low-precision post-training quantization of large language models
Arxiv 2024 [Paper]
QuAILoRA: Quantization-Aware Initialization for LoRA
NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV) 2024 [Paper]
SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
NeurIPS 2024 [Paper]
Pyramid Vector Quantization for LLMs
Arxiv 2024 [Paper]
TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction
Arxiv 2024 [Paper] [Code]
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Arxiv 2024 [Paper] [Code]
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
Arxiv 2024 [Paper] [Code]
GWQ: Gradient-Aware Weight Quantization for Large Language Models
Arxiv 2024 [Paper]
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
Arxiv 2024 [Paper]
Interactions Across Blocks in Post-Training Quantization of Large Language Models
Arxiv 2024 [Paper]
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Arxiv 2024 [Paper]
The Super Weight in Large Language Models
Arxiv 2024 [Paper] [Code]
ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization
Arxiv 2024 [Paper]
Towards Low-bit Communication for Tensor Parallel LLM Inference
Arxiv 2024 [Paper]
AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
Arxiv 2024 [Paper] [Code]
Scaling Laws for Precision
Arxiv 2024 [Paper]
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
HPCA 2025 [Paper] [Code]
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
Arxiv 2024 [Paper] [Code]
AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning
Arxiv 2024 [Paper]
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
HPCA 2025 [Paper]
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Arxiv 2024 [Paper]
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem
Arxiv 2024 [Paper]
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
Arxiv 2024 [Paper] [Models]
DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation
Arxiv 2024 [Paper] [Code]
RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy
Arxiv 2024 [Paper]
CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models
Arxiv 2024 [Paper]
SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization
Arxiv 2024 [Paper]
Direct Quantized Training of Language Models with Stochastic Rounding
Arxiv 2024 [Paper] [Code]
Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization
Arxiv 2024 [Paper]
Low-Rank Correction for Quantized LLMs
Arxiv 2024 [Paper]
CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs
Arxiv 2024 [Paper]
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
Arxiv 2024 [Paper] [Code]
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Arxiv 2024 [Paper]
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
Arxiv 2024 [Paper]
Pruning and Sparsity
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
ICLR 2023 [Paper]
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
ICML 2023 [Paper] [Code]
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
ICML 2023 [Paper] [Code]
LLM-Pruner: On the Structural Pruning of Large Language Models
NeurIPS 2023 [Paper] [Code]
ZipLM: Inference-Aware Structured Pruning of Language Models
NeurIPS 2023 [Paper] [Code]
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
NeurIPS 2023 [Paper] [Code]
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
NeurIPS 2023 [Paper] [Code]
Learning to Compress Prompts with Gist Tokens
NeurIPS 2023 [Paper]
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
NeurIPS 2023 [Paper]
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
ICLR 2023 TinyPapers [Paper]
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
ICML 2023 [Paper] [Code]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
ICLR 2023 [Paper]
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
ACL 2023 [Paper] [Code]
Structured Pruning for Efficient Generative Pre-trained Language Models
ACL 2023 [Paper]
A Simple and Effective Pruning Approach for Large Language Models
ICLR 2024 [Paper] [Code]
Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
ACL Findings 2024 [Paper]
Structural pruning of large language models via neural architecture search
AutoML 2023 [Paper]
Pruning Large Language Models via Accuracy Predictor
ICASSP 2024 [Paper]
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
VLDB 2024 [Paper] [Cde]
Compressing LLMs: The Truth is Rarely Pure and Never Simple
ICLR 2024 [Paper]
Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
ICML 2024 [Paper] [Code]
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Arxiv 2023 [Paper] [Code]
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Arxiv 2023 [Paper] [Code]
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Arxiv 2023 [Paper] [Code]
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
ICLR 2024 [Paper] [Code]
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
ICASSP 2024 [Paper]
Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
EMNLP Findings 2023 [Paper]
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
EMNLP Findings 2023 [Paper]
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
Arxiv 2023 [Paper]
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
Arxiv 2023 [Paper]
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Arxiv 2023 [Paper]
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
Arxiv 2023 [Paper]
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models
Arxiv 2023 [Paper] [Code]
On the Impact of Calibration Data in Post-training Quantization and Pruning
ACL 2024 [Paper]
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
OpenReview [Paper] [Code]
PUSHING GRADIENT TOWARDS ZERO: A NOVEL PRUNING METHOD FOR LARGE LANGUAGE MODELS
OpenReview 2023 [Paper]
Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
ICLR 2024 [Paper] [Code]
Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
Arxiv 2023 [Paper] [Code]
LORAPRUNE: PRUNING MEETS LOW-RANK PARAMETER-EFFICIENT FINE-TUNING
Arxiv 2023 [Paper]
Mini-GPTs: Efficient Large Language Models through Contextual Pruning
Arxiv 2023 [Paper] [Code]
The LLM Surgeon
Arxiv 2023 [Paper]
Fluctuation-based Adaptive Structured Pruning for Large Language Models
AAAI 2024 [Paper]
How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark
CPAL 2024 [Paper]
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
Arxiv 2023 [Paper]
Fast and Optimal Weight Update for Pruned Large Language Models
Arxiv 2024 [Paper]
APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference
Arxiv 2024 [Paper]
Scaling Sparse Fine-Tuning to Large Language Models
Arxiv 2024 [Paper]
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
ICLR 2024 [Paper] [Code]
Shortened LLaMA: A Simple Depth Pruning for Large Language Models
Arxiv 2024 [Paper]
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
Arxiv 2024 [Paper] [Code]
NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
Arxiv 2024 [Paper]
LaCo: Large Language Model Pruning via Layer Collapse
EMNLP Findings 2024 [Paper]
Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers
Arxiv 2024 [Paper]
EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs
Arxiv 2024 [Paper] [Code]
Data-free Weight Compress and Denoise for Large Language Models
Arxiv 2024 [Paper]
Gradient-Free Adaptive Global Pruning for Pre-trained Language Models
Arxiv 2024 [Paper]
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Arxiv 2024 [Paper]
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Arxiv 2024 [Paper] [Code]
Compressing Large Language Models by Streamlining the Unimportant Layer
Arxiv 2024 [Paper]
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models
Arxiv 2024 [Paper]
LoNAS: Elastic Low-Rank Adapters for Efficient Large Language Models
COLING 2024 [Paper] [Code]
Shears: Unstructured Sparsity with Neural Low-rank Adapter Search
NAACL 2024 [Paper] [Code]
Eigenpruning
NAACL 2024 Abstract [Paper]
OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning
Arxiv 2024 [Paper]
Pruning as a Domain-specific LLM Extractor
NAACL 2024 Findings [Paper] [Code]
Differentiable Model Scaling using Differentiable Topk
ICML 2024 [Paper]
COPAL: Continual Pruning in Large Language Generative Models
ICML 2024 [Paper]
Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models
ICML 2024 [Paper] [Code]
Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization
ACL Findings 2024 [Paper]
Surgical Feature-Space Decomposition of LLMs: Why, When and How?
ACL 2024 [Paper]
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
ACL Findings 2024 [Paper]
Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning
ACL Findings 2024 [Paper] [Code]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
ICML 2024 [Paper] [Code]
MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations
Arxiv 2024 [Paper] [Code]
ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models
Arxiv 2024 [Paper]
HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
Arxiv 2024 [Paper]
Optimization-based Structural Pruning for Large Language Models without Back-Propagation
Arxiv 2024 [Paper]
BlockPruner: Fine-grained Pruning for Large Language Models
Arxiv 2024 [Paper] [Code]
Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization
Arxiv 2024 [Paper]
RankAdaptor: Hierarchical Dynamic Low-Rank Adaptation for Structural Pruned LLMs
Arxiv 2024 [Paper]
What Matters in Transformers? Not All Attention is Needed
Arxiv 2024 [Paper] [Code]
Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
EMNLP 2024 [Paper]
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
Arxiv 2024 [Paper] [Code]
Finding Transformer Circuits with Edge Pruning
Arxiv 2024 [Paper] [Code]
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
Arxiv 2024 [Paper] [Code]
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
Arxiv 2024 [Paper]
Reconstruct the Pruned Model without Any Retraining
Arxiv 2024 [Paper]
A deeper look at depth pruning of LLMs
ICML TF2M Workshop 2024 [Paper] [Code]
Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
Arxiv 2024 [Paper]
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training
Arxiv 2024 [Paper]
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models
Arxiv 2024 [Paper]
ThinK: Thinner Key Cache by Query-Driven Pruning
Arxiv 2024 [Paper]
MoDeGPT: Modular Decomposition for Large Language Model Compression
Arxiv 2024 [Paper]
LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models
Arxiv 2024 [Paper] [Code]
LLM Pruning and Distillation in Practice: The Minitron Approach
Arxiv 2024 [Paper] [Models]
Training-Free Activation Sparsity in Large Language Models
Arxiv 2024 [Paper]
PAT: Pruning-Aware Tuning for Large Language Models
Arxiv 2024 [Paper] [Code]
Sirius: Contextual Sparsity with Correction for Efficient LLMs
Arxiv 2024 [Paper] [Code]
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
Arxiv 2024 [Paper]
DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models
NeurIPS 2024 [Paper]
Search for Efficient Large Language Models
NeurIPS 2024 [Paper]
SlimGPT: Layer-wise Structured Pruning for Large Language Models
NeurIPS 2024 [Paper]
Learn To be Efficient: Build Structured Sparsity in Large Language Models
NeurIPS 2024 [Paper]
ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment
NeurIPS 2024 [Paper]
Getting Free Bits Back from Rotational Symmetries in LLMs
Arxiv 2024 [Paper]
SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs
Arxiv 2024 [Paper] [Code]
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
NeurIPS 2024 Machine Learning and Compression Workshop [Paper]
EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search
Arxiv 2024 [Paper] [Code]
Pruning Foundation Models for High Accuracy without Retraining
EMNLP Findings 2024 [Paper] [Code]
Beware of Calibration Data for Pruning Large Language Models
Arxiv 2024 [Paper]
SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models
EMNLP Findings 2024 [Paper] [Code]
Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy
EMNLP Findings 2024 [Paper] [Code]
Scaling Law for Post-training after Model Pruning
Arxiv 2024 [Paper]
LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion
ACL 2024 [Paper]
Distillation
Lifting the Curse of Capacity Gap in Distilling Language Models
ACL 2023 [Paper] [Code]
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
ACL 2023 [Paper]
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
ACL 2023 [Paper]
SCOTT: Self-Consistent Chain-of-Thought Distillation
ACL 2023 [Paper]
DISCO: Distilling Counterfactuals with Large Language Models
ACL 2023 [Paper] [Code]
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Arxiv 2023 [Paper] [Code]
How To Train Your (Compressed) Large Language Model
Arxiv 2023 [Paper]
The False Promise of Imitating Proprietary LLMs
Arxiv 2023 [Paper]
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Arxiv 2023 [Paper] [Code]
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Arxiv 2023 [Paper]
MiniLLM: Knowledge Distillation of Large Language Models
ICLR 2024 [Paper] [Code]
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
ICLR 2024 [Paper]
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
ICLR 2024 [Paper]
Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction
Arxiv 2023 [Paper]
Task-agnostic Distillation of Encoder-Decoder Language Models
Arxiv 2023 [Paper]
Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA
Arxiv 2023 [Paper]
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
CoNLL 2023 [Paper] [Code]
Can a student Large Language Model perform as well as it's teacher?
Arxiv 2023 [Paper]
Multistage Collaborative Knowledge Distillation from Large Language Models
ACL 2024 [Paper] [Code]
Lion: Adversarial Distillation of Closed-Source Large Language Model
EMNLP 2023 [Paper] [Code]
MCC-KD: Multi-CoT Consistent Knowledge Distillation
EMNLP 2023 [Paper]
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation
EMNLP 2023 [Paper]
YODA: Teacher-Student Progressive Learning for Language Models
Arxiv 2023 [Paper]
Knowledge Fusion of Large Language Models
ICLR 2024 [Paper] [Code]
Knowledge Distillation for Closed-Source Language Models
Arxiv 2024 [Paper]
TinyLLM: Learning a Small Student from Multiple Large Language Models
Arxiv 2024 [Paper]
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Arxiv 2024 [Paper]
Revisiting Knowledge Distillation for Autoregressive Language Models
ACL 2024 [Paper]
Sinkhorn Distance Minimization for Knowledge Distillation
COLING 2024 [Paper]
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Arxiv 2024 [Paper]
Learning to Maximize Mutual Information for Chain-of-Thought Distillation
ACL 2024 Findings [Paper]
DistiLLM: Towards Streamlined Distillation for Large Language Models
ICML 2024 [Paper] [Code]
Efficiently Distilling LLMs for Edge Applications
NAACL 2024 [Paper]
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Arxiv 2024 [Paper]
Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs
Arxiv 2024 [Paper]
Direct Preference Knowledge Distillation for Large Language Models
Arxiv 2024 [Paper] [Codes]
Dual-Space Knowledge Distillation for Large Language Models
Arxiv 2024 [Paper] [Codes]
DDK: Distilling Domain Knowledge for Efficient Large Language Models
Arxiv 2024 [Paper]
Compact Language Models via Pruning and Knowledge Distillation
Arxiv 2024 [Paper] [Code]
LLM Pruning and Distillation in Practice: The Minitron Approach
Arxiv 2024 [Paper] [Models]
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Arxiv 2024 [Paper]
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
EMNLP 2024 [Paper]
SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
Arxiv 2024 [Paper]
Mentor-KD: Making Small Language Models Better Multi-step Reasoners
EMNLP 2024 [Paper] [Code]
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models
Arxiv 2024 [Paper]
LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models
Arxiv 2024 [Paper] [Code]
Efficient Prompting
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
ACL 2023 [Paper] [Code]
Batch Prompting: Efficient Inference with Large Language Model APIs
EMNLP 2023 [Paper] [Code]
Adapting Language Models to Compress Contexts
EMNLP 2023 [Paper] [Code]
Compressing Context to Enhance Inference Efficiency of Large Language Models
EMNLP 2023 [Paper] [Code]
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
EMNLP 2023 [Paper] [Code]
Vector-Quantized Prompt Learning for Paraphrase Generation
EMNLP Findings 2023 [Paper]
Efficient Prompting via Dynamic In-Context Learning
Arxiv 2023 [Paper]
Learning to Compress Prompts with Gist Tokens
NeurIPS 2023 [Paper] [Code]
In-context Autoencoder for Context Compression in a Large Language Model
ICLR 2024 [Paper]
Discrete Prompt Compression with Reinforcement Learning
Arxiv 2023 [Paper] [Code]
BatchPrompt: Accomplish more with less
Arxiv 2023 [Paper]
(Dynamic) Prompting might be all you need to repair Compressed LLMs
Arxiv 2023 [Paper]
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
Arxiv 2023 [Paper] [Code]
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
ACL 2023 [Paper] [Code]
Extending Context Window of Large Language Models via Semantic Compression
Arxiv 2023 [Paper]
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning
EMNLP 2024 [Paper] [Code]
The Impact of Reasoning Step Length on Large Language Models
ACL 2024 Findings [Paper]
Compressed Context Memory For Online Language Model Interaction
ICLR 2024 [Paper] [Code]
Learning to Compress Prompt in Natural Language Formats
Arxiv 2024 [Paper]
Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression
Arxiv 2024 [Paper] [Code]
StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses
Arxiv 2024 [Paper]
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Arxiv 2024 [Paper] [Code]
PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models
Arxiv 2024 [Paper] [Code]
PROMPT-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression
Arxiv 2024 [Paper]
Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization
Arxiv 2024 [Paper] [Code]
Adapting LLMs for Efficient Context Processing through Soft Prompt Compression
IPCA 2024 [Paper]
Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation
Arxiv 2024 [Paper]
Unifying Demonstration Selection and Compression for In-Context Learning
Arxiv 2024 [Paper]
SelfCP: Compressing Long Prompt to 1/12 Using the Frozen Large Language Model Itself
Arxiv 2024 [Paper]
Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models
Arxiv 2024 [Paper]
QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression
Arxiv 2024 [Paper] [Code]
500xCompressor: Generalized Prompt Compression for Large Language Models
Arxiv 2024 [Paper]
Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
Arxiv 2024 [Paper]
Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Arxiv 2024 [Paper] [Code]
Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
Arxiv 2024 [Paper]
Parse Trees Guided LLM Prompt Compression
Arxiv 2024 [Paper]
AlphaZip: Neural Network-Enhanced Lossless Text Compression
Arxiv 2024 [Paper]
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
Arxiv 2024 [Paper] [Code]
Perception Compressor:A training-free prompt compression method in long context scenarios
Arxiv 2024 [Paper]
From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression
EMNLP Findings 2024 [Paper]
Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability
EMNLP Findings 2024 [Paper]
Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles
EMNLP Findings 2024 [Paper]
KV Cache Compression
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
NeurIPS 2023 [Paper]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
ICLR 2024 [Paper]
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
NeurIPS 2024 [Paper]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
ICML 2024 [Paper] [Code]
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
Arxiv 2024 [Paper]
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
MLSys 2024 [Paper]
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
Arxiv 2024 [Paper]
QAQ: Quality Adaptive Quantization for LLM KV Cache
Arxiv 2024 [Paper] [Code]
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Arxiv 2024 [Paper]
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
ACL 2024 [Paper]
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression
Arxiv 2024 [Paper]
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
Arxiv 2024 [Paper]
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Arxiv 2024 [Paper]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Arxiv 2024 [Paper]
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
Arxiv 2024 [Paper] [Code]
Effectively Compress KV Heads for LLM
Arxiv 2024 [Paper]
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression
EMNLP 2024 [Paper]
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Arxiv 2024 [Paper]
Palu: Compressing KV-Cache with Low-Rank Projection
Arxiv 2024 [Paper] [Code]
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Arxiv 2024 [Paper]
Finch: Prompt-guided Key-Value Cache Compression
Arxiv 2024 [Paper]
Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
Arxiv 2024 [Paper]
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
EMNLP Findings 2024 [Paper] [Code]
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
Arxiv 2024 [Paper] [Code]
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Arxiv 2024 [Paper]
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction
Arxiv 2024 [Paper] [Code]
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Arxiv 2024 [Paper]
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
Arxiv 2024 [Paper]
Residual vector quantization for KV cache compression in large language model
Arxiv 2024 [Paper] [Code]
Lossless KV Cache Compression to 2%
Arxiv 2024 [Paper]
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
Arxiv 2024 [Paper] [Code]
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Arxiv 2024 [Paper] [Code]
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time
ACL 2024 [Paper] [Code]
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
NeurIPS 2024 [Paper]
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Arxiv 2024 [Paper]
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
Arxiv 2024 [Paper]
Unifying KV Cache Compression for Large Language Models with LeanKV
Arxiv 2024 [Paper]
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Arxiv 2024 [Paper]
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
Arxiv 2024 [Paper] [Code]
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
Arxiv 2024 [Paper]
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator
Arxiv 2024 [Paper] [Code]
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Arxiv 2024 [Paper]
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Arxiv 2024 [Paper] [Code]
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
Arxiv 2024 [Paper]
Other
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022 [Paper] [Code]
TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
Arxiv 2023 [Paper]
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
NeurIPS 2023 [Paper]
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
Arxiv 2023 [Paper]
Scaling In-Context Demonstrations with Structured Attention
Arxiv 2023 [Paper]
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Arxiv 2023 [Paper] [Code]
CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
Arxiv 2023 [Paper]
Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping
Arxiv 2023 [Paper]
LLMCad: Fast and Scalable On-device Large Language Model Inference
Arxiv 2023 [Paper]
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
Arxiv 2023 [Paper]
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Arxiv 2023 [Paper] [Code]
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
Arxiv 2023 [Paper] [Code]
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Arxiv 2023 [Paper]
Efficient Streaming Language Models with Attention Sinks
Arxiv 2023 [Paper] [Code]
Efficient Large Language Models Fine-Tuning On Graphs
Arxiv 2023 [Paper]
SparQ Attention: Bandwidth-Efficient LLM Inference
Arxiv 2023 [Paper]
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models
Arxiv 2023 [Paper]
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Arxiv 2023 [Paper] [Code]
Dataset Quantization
ICCV 2023 [Paper] [Code]
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
NeurIPS 2023 [Paper] [Code]
Context Compression for Auto-regressive Transformers with Sentinel Tokens
EMNLP 2023 [Paper] [Code]
TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
EMNLP Findings 2023 [Paper]
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression
EMNLP Findings 2023 [Paper]
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
Arxiv 2024 [Paper]
LoMA: Lossless Compressed Memory Attention
Arxiv 2024 [Paper]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Arxiv 2024 [Paper] [Code]
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Arxiv 2024 [Paper] [Code]
CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks
Arxiv 2024 [Paper]
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
ICML 2024 [Paper] [Code]
BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models
Arxiv 2024 [Paper] [Code]
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Arxiv 2024 [Paper]
Not all Layers of LLMs are Necessary during Inference
Arxiv 2024 [Paper]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Arxiv 2024 [Paper]
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Arxiv 2024 [Paper]
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
HPCA 2024 [Paper]
ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models
Arxiv 2024 [Paper]
Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation
Arxiv 2024 [Paper]
Training LLMs over Neurally Compressed Text
Arxiv 2024 [Paper]
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Arxiv 2024 [Paper] [Code]
SnapKV: LLM Knows What You are Looking for Before Generation
Arxiv 2024 [Paper] [Code]
Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models
Arxiv 2024 [Paper]
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
ICML 2024 [Paper]
Token-wise Influential Training Data Retrieval for Large Language Models
ACL 2024 [Paper] [Code]
Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications
Arxiv 2024 [Paper]
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
Arxiv 2024 [Paper] [Code]
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Arxiv 2024 [Paper]
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering
Arxiv 2024 [Paper]
CaM: Cache Merging for Memory-efficient LLMs Inference
ICML 2024 [Paper] [Code]
CLLMs: Consistency Large Language Models
ICML 2024 [Paper] [Code]
MoDeGPT: Modular Decomposition for Large Language Model Compression
Arxiv 2024 [Paper]
Accelerating Large Language Model Training with Hybrid GPU-based Compression
Arxiv 2024 [Paper]
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models
NeurIPS 2024 [Paper]
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
Arxiv 2024 [Paper]
InfiniPot: Infinite Context Processing on Memory-Constrained LLMs
EMNLP 2024 [Paper]
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
Arxiv 2024 [Paper] [Code]
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
Arxiv 2024 [Paper]
Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression
Arxiv 2024 [Paper] [Code]
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
Arxiv 2024 [Paper]
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Arxiv 2024 [Paper] [Code]
Progressive Mixed-Precision Decoding for Efficient LLM Inference
Arxiv 2024 [Paper]
EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
Arxiv 2024 [Paper]
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment
NeurIPS 2024 Datasets and Benchmarks Track [Paper] [Code]
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
Arxiv 2024 [paper] [Code]
LLM Vocabulary Compression for Low-Compute Environments
Machine Learning and Compression Workshop @ NeurIPS 2024 [paper]
Tools
BMCook: Model Compression for Big Models [Code]
llama.cpp: Inference of LLaMA model in pure C/C++ [Code]
LangChain: Building applications with LLMs through composability [Code]
GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]
Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]
Also, if you could recommend any small models which are compatible with the Nemotron family of models, that would be very helpful! I would like to do speculative sampling using llama.cpp's draft model feature but all Llama 3.1/3.2 ones apparently have a slightly different vocabulary than this one.
I can reproduce your problem with several GGUFs. To pinpoint the problem, I reduced the input gradually and found that the problems occur when it is near or more than 4K tokens.
I have reported to llama.cpp as a bug. It would be great if someone there can give me clues as to what went wrong.
https://github.com/ggerganov/llama.cpp/issues/11002
Thanks a lot for reporting this bug.
I think I fixed the bug. It is because convert_hf_to_gguf.py doesn't read rope_theta in config.json such that it set it to 10000.0 instead of 500000.0. I fixed that and generated a Q4_K_M that can work with your prompt.
The updated Q4_K_M is already uploaded here. Please download it to verify. There is no need to re-compile llama.cpp.
I will make a pull request at llama.cpp to fix this bug. After I regenerated all the ggufs, I will post at LocalLlama to update the situation.
Thanks again for pointing out the bug. I believe this model should now work up to 64K prompt according to Nvidia RULER. Let me know if you find other problems.
Wow, thank you so much for really looking into the changes and fixing the bug. I was worried that this was an issue with the original model, but fortunately that's not the case as you pointed out. I've downloaded the Q4_K_M, and it's working just fine for large context sizes (testing with 20k+, it looks good).
This is really amazing. I can now double my token throughput on the same hardware (2.1 tok/s on 70b -> 4.2 tok/s on 51b). Thank you again!
Also, thank you for fixing the vocabulary settings. Now this model works correctly with Llama-compatible draft models.
I also thought it was a model issue, but I found that the original long context had no problems when I tried it on the NVIDIA website. Thank you for your fix.