Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN Paper • 2412.13795 • Published Dec 18, 2024 • 19
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers Paper • 2303.01610 • Published Mar 2, 2023
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients Paper • 2407.11239 • Published Jul 15, 2024 • 8
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients Paper • 2407.08296 • Published Jul 11, 2024 • 31
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients Paper • 2407.08296 • Published Jul 11, 2024 • 31
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients Paper • 2407.08296 • Published Jul 11, 2024 • 31 • 3
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding Paper • 2403.04797 • Published Mar 5, 2024 • 1
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention Paper • 2310.00535 • Published Oct 1, 2023 • 2
Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights? Paper • 2302.12480 • Published Feb 24, 2023
You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership Paper • 2111.00162 • Published Oct 30, 2021 • 1
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy Paper • 2310.01334 • Published Oct 2, 2023 • 3
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference Paper • 2402.09398 • Published Feb 14, 2024
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Paper • 2403.03507 • Published Mar 6, 2024 • 184