matlok
's Collections
Papers - MoE
updated
Non-asymptotic oracle inequalities for the Lasso in high-dimensional
mixture of experts
Paper
•
2009.10622
•
Published
•
1
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
•
2401.15947
•
Published
•
49
MoE-Mamba: Efficient Selective State Space Models with Mixture of
Experts
Paper
•
2401.04081
•
Published
•
71
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE
Serving
Paper
•
2401.14361
•
Published
•
2
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable
Mixture-of-Expert Inference
Paper
•
2308.12066
•
Published
•
4
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
Paper
•
2308.14352
•
Published
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
•
2308.06093
•
Published
•
2
Enhancing the "Immunity" of Mixture-of-Experts Networks for Adversarial
Defense
Paper
•
2402.18787
•
Published
•
2
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
Competition
Paper
•
2402.02526
•
Published
•
3
GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
Paper
•
2006.16668
•
Published
•
3
Scaling Vision with Sparse Mixture of Experts
Paper
•
2106.05974
•
Published
•
3
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
•
2211.15841
•
Published
•
7
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
•
1701.06538
•
Published
•
5
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper
•
2402.01739
•
Published
•
26
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper
•
2202.08906
•
Published
•
2
LocMoE: A Low-overhead MoE for Large Language Model Training
Paper
•
2401.13920
•
Published
•
2
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
Paper
•
2201.05596
•
Published
•
2
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Paper
•
2304.11414
•
Published
•
2
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
•
2401.06066
•
Published
•
43
AMEND: A Mixture of Experts Framework for Long-tailed Trajectory
Prediction
Paper
•
2402.08698
•
Published
•
2
Routers in Vision Mixture of Experts: An Empirical Study
Paper
•
2401.15969
•
Published
•
2
BASE Layers: Simplifying Training of Large, Sparse Models
Paper
•
2103.16716
•
Published
•
3
DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
Paper
•
2106.03760
•
Published
•
3
Hash Layers For Large Sparse Models
Paper
•
2106.04426
•
Published
•
2
Direct Neural Machine Translation with Task-level Mixture of Experts
models
Paper
•
2310.12236
•
Published
•
2
Adaptive Gating in Mixture-of-Experts based Language Models
Paper
•
2310.07188
•
Published
•
2
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
Routing Policy
Paper
•
2310.01334
•
Published
•
3
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts
Paper
•
2309.04354
•
Published
•
13
Towards More Effective and Economic Sparsely-Activated Model
Paper
•
2110.07431
•
Published
•
2
Taming Sparsely Activated Transformer with Stochastic Experts
Paper
•
2110.04260
•
Published
•
2
Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference
Paper
•
2110.03742
•
Published
•
3
FedJETs: Efficient Just-In-Time Personalization with Federated Mixture
of Experts
Paper
•
2306.08586
•
Published
•
1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
Paper
•
2306.04845
•
Published
•
4
Balanced Mixture of SuperNets for Learning the CNN Pooling Architecture
Paper
•
2306.11982
•
Published
•
2
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Unified Scaling Laws for Routed Language Models
Paper
•
2202.01169
•
Published
•
2
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
•
2310.16795
•
Published
•
26
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
•
2403.19887
•
Published
•
104
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper
•
2212.05055
•
Published
•
5
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
•
2404.07413
•
Published
•
36
Fast Feedforward Networks
Paper
•
2308.14711
•
Published
•
2
MoDE: CLIP Data Experts via Clustering
Paper
•
2404.16030
•
Published
•
12
Paper
•
2407.10671
•
Published
•
155
Mixture of A Million Experts
Paper
•
2407.04153
•
Published
•
4
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
•
2408.12570
•
Published
•
30