lgaalves
's Collections
mixture-of-experts
updated
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
•
1701.06538
•
Published
•
5
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper
•
1907.04840
•
Published
•
3
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper
•
1910.02054
•
Published
•
4
A Mixture of h-1 Heads is Better than h Heads
Paper
•
2005.06537
•
Published
•
2
GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
Paper
•
2006.16668
•
Published
•
3
Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
Paper
•
2101.03961
•
Published
•
14
FastMoE: A Fast Mixture-of-Expert Training System
Paper
•
2103.13262
•
Published
•
2
BASE Layers: Simplifying Training of Large, Sparse Models
Paper
•
2103.16716
•
Published
•
3
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
•
2105.03036
•
Published
•
2
DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
Paper
•
2106.03760
•
Published
•
3
Scaling Vision with Sparse Mixture of Experts
Paper
•
2106.05974
•
Published
•
3
Hash Layers For Large Sparse Models
Paper
•
2106.04426
•
Published
•
2
DEMix Layers: Disentangling Domains for Modular Language Modeling
Paper
•
2108.05036
•
Published
•
3
A Machine Learning Perspective on Predictive Coding with PAQ
Paper
•
1108.3298
•
Published
•
2
Efficient Large Scale Language Modeling with Mixtures of Experts
Paper
•
2112.10684
•
Published
•
2
Unified Scaling Laws for Routed Language Models
Paper
•
2202.01169
•
Published
•
2
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper
•
2202.08906
•
Published
•
2
Mixture-of-Experts with Expert Choice Routing
Paper
•
2202.09368
•
Published
•
3
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
•
2206.02770
•
Published
•
3
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language
Models
Paper
•
2208.03306
•
Published
•
2
A Review of Sparse Expert Models in Deep Learning
Paper
•
2209.01667
•
Published
•
3
Sparsity-Constrained Optimal Transport
Paper
•
2209.15466
•
Published
•
1
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper
•
2210.05144
•
Published
•
2
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
•
2211.15841
•
Published
•
7
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper
•
2212.05055
•
Published
•
5
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models
Paper
•
2305.14705
•
Published
From Sparse to Soft Mixtures of Experts
Paper
•
2308.00951
•
Published
•
20
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Paper
•
2310.10837
•
Published
•
10
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
•
2310.16795
•
Published
•
26
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
•
2312.07987
•
Published
•
40
Mixture of Cluster-conditional LoRA Experts for Vision-language
Instruction Tuning
Paper
•
2312.12379
•
Published
•
2
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
•
2312.17238
•
Published
•
7
Paper
•
2401.04088
•
Published
•
159
MoE-Mamba: Efficient Selective State Space Models with Mixture of
Experts
Paper
•
2401.04081
•
Published
•
71
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
•
2401.06066
•
Published
•
43