stereoplegic
's Collections
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
•
2310.16795
•
Published
•
26
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable
Mixture-of-Expert Inference
Paper
•
2308.12066
•
Published
•
4
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
Paper
•
2303.06182
•
Published
•
1
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via
Dense-To-Sparse Gate
Paper
•
2112.14397
•
Published
•
1
From Sparse to Soft Mixtures of Experts
Paper
•
2308.00951
•
Published
•
20
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
•
2308.06093
•
Published
•
2
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient
Vision Transformer
Paper
•
2306.06446
•
Published
•
1
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper
•
2212.05055
•
Published
•
5
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
Paper
•
2212.05191
•
Published
•
1
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
Paper
•
2306.04073
•
Published
•
2
Multi-Head Adapter Routing for Cross-Task Generalization
Paper
•
2211.03831
•
Published
•
2
Improving Visual Prompt Tuning for Self-supervised Vision Transformers
Paper
•
2306.05067
•
Published
•
2
A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies
Paper
•
2302.06218
•
Published
•
1
Alternating Gradient Descent and Mixture-of-Experts for Integrated
Multimodal Perception
Paper
•
2305.06324
•
Published
•
1
Sparse Backpropagation for MoE Training
Paper
•
2310.00811
•
Published
•
2
Zorro: the masked multimodal transformer
Paper
•
2301.09595
•
Published
•
2
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
Paper
•
2201.05596
•
Published
•
2
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Paper
•
2310.10837
•
Published
•
10
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
Paper
•
2303.13755
•
Published
•
1
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Paper
•
2310.15961
•
Published
•
1
LoRA ensembles for large language model fine-tuning
Paper
•
2310.00035
•
Published
•
2
Build a Robust QA System with Transformer-based Mixture of Experts
Paper
•
2204.09598
•
Published
•
1
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and
Compositional Experts
Paper
•
2305.14839
•
Published
•
1
A Mixture-of-Expert Approach to RL-based Dialogue Management
Paper
•
2206.00059
•
Published
•
1
Spatial Mixture-of-Experts
Paper
•
2211.13491
•
Published
•
1
FastMoE: A Fast Mixture-of-Expert Training System
Paper
•
2103.13262
•
Published
•
2
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training
and Inference System
Paper
•
2205.10034
•
Published
•
1
Eliciting and Understanding Cross-Task Skills with Task-Level
Mixture-of-Experts
Paper
•
2205.12701
•
Published
•
1
FEAMOE: Fair, Explainable and Adaptive Mixture of Experts
Paper
•
2210.04995
•
Published
•
1
On the Adversarial Robustness of Mixture of Experts
Paper
•
2210.10253
•
Published
•
1
HMOE: Hypernetwork-based Mixture of Experts for Domain Generalization
Paper
•
2211.08253
•
Published
•
1
Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
Paper
•
2101.03961
•
Published
•
14
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training
Paper
•
2303.06318
•
Published
•
1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
Paper
•
2306.04845
•
Published
•
4
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation
Paper
•
2210.07535
•
Published
•
1
Optimizing Mixture of Experts using Dynamic Recompilations
Paper
•
2205.01848
•
Published
•
1
Towards Understanding Mixture of Experts in Deep Learning
Paper
•
2208.02813
•
Published
•
1
Learning Factored Representations in a Deep Mixture of Experts
Paper
•
1312.4314
•
Published
•
1
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
•
1701.06538
•
Published
•
5
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Paper
•
2112.06905
•
Published
•
1
Contextual Mixture of Experts: Integrating Knowledge into Predictive
Modeling
Paper
•
2211.00558
•
Published
•
1
Taming Sparsely Activated Transformer with Stochastic Experts
Paper
•
2110.04260
•
Published
•
2
Heterogeneous Multi-task Learning with Expert Diversity
Paper
•
2106.10595
•
Published
•
1
SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code
Translation
Paper
•
2310.15539
•
Published
•
1
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA
Composition
Paper
•
2307.13269
•
Published
•
31
SkillNet-NLG: General-Purpose Natural Language Generation with a
Sparsely Activated Approach
Paper
•
2204.12184
•
Published
•
1
SkillNet-NLU: A Sparsely Activated Model for General-Purpose Natural
Language Understanding
Paper
•
2203.03312
•
Published
•
1
Residual Mixture of Experts
Paper
•
2204.09636
•
Published
•
1
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models
Paper
•
2203.01104
•
Published
•
2
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit
from Emergent Modular Structures?
Paper
•
2310.10908
•
Published
•
1
One Student Knows All Experts Know: From Sparse to Dense
Paper
•
2201.10890
•
Published
•
1
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed
Training System
Paper
•
2203.14685
•
Published
•
1
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts
Paper
•
2305.18691
•
Published
•
1
An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training
Paper
•
2306.17165
•
Published
•
1
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
•
2211.15841
•
Published
•
7
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
•
2105.03036
•
Published
•
2
Language-Routing Mixture of Experts for Multilingual and Code-Switching
Speech Recognition
Paper
•
2307.05956
•
Published
•
1
M6-T: Exploring Sparse Expert Models and Beyond
Paper
•
2105.15082
•
Published
•
1
Cross-token Modeling with Conditional Computation
Paper
•
2109.02008
•
Published
•
1
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability
Paper
•
2204.10598
•
Published
•
2
Efficient Language Modeling with Sparse all-MLP
Paper
•
2203.06850
•
Published
•
1
Efficient Large Scale Language Modeling with Mixtures of Experts
Paper
•
2112.10684
•
Published
•
2
TAME: Task Agnostic Continual Learning using Multiple Experts
Paper
•
2210.03869
•
Published
•
1
Learning an evolved mixture model for task-free continual learning
Paper
•
2207.05080
•
Published
•
1
Model Spider: Learning to Rank Pre-Trained Models Efficiently
Paper
•
2306.03900
•
Published
•
1
Task-Specific Expert Pruning for Sparse Mixture-of-Experts
Paper
•
2206.00277
•
Published
•
1
SiRA: Sparse Mixture of Low Rank Adaptation
Paper
•
2311.09179
•
Published
•
8
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning
Paper
•
2309.05444
•
Published
•
1
MoEC: Mixture of Expert Clusters
Paper
•
2207.09094
•
Published
•
1
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of
Experts And Frequency-augmented Decoder Approach
Paper
•
2310.12004
•
Published
•
2
A General Theory for Softmax Gating Multinomial Logistic Mixture of
Experts
Paper
•
2310.14188
•
Published
•
1
Extending Mixture of Experts Model to Investigate Heterogeneity of
Trajectories: When, Where and How to Add Which Covariates
Paper
•
2007.02432
•
Published
•
1
Mixture of experts models for multilevel data: modelling framework and
approximation theory
Paper
•
2209.15207
•
Published
•
1
ComPEFT: Compression for Communicating Parameter Efficient Updates via
Sparsification and Quantization
Paper
•
2311.13171
•
Published
•
1
The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles
Paper
•
2306.01705
•
Published
•
1
Exponentially Faster Language Modelling
Paper
•
2311.10770
•
Published
•
118
Scaling Expert Language Models with Unsupervised Domain Discovery
Paper
•
2303.14177
•
Published
•
2
Hash Layers For Large Sparse Models
Paper
•
2106.04426
•
Published
•
2
Memory-efficient NLLB-200: Language-specific Expert Pruning of a
Massively Multilingual Machine Translation Model
Paper
•
2212.09811
•
Published
•
1
Exploiting Transformer Activation Sparsity with Dynamic Inference
Paper
•
2310.04361
•
Published
•
1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
Quantization and Robustness
Paper
•
2310.02410
•
Published
•
1
Punica: Multi-Tenant LoRA Serving
Paper
•
2310.18547
•
Published
•
2
Merging Experts into One: Improving Computational Efficiency of Mixture
of Experts
Paper
•
2310.09832
•
Published
•
1
Adaptive Gating in Mixture-of-Experts based Language Models
Paper
•
2310.07188
•
Published
•
2
Making Small Language Models Better Multi-task Learners with
Mixture-of-Task-Adapters
Paper
•
2309.11042
•
Published
•
2
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
•
2312.07987
•
Published
•
40
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of
Low-rank Experts
Paper
•
2312.00968
•
Published
•
1
Memory Augmented Language Models through Mixture of Word Experts
Paper
•
2311.10768
•
Published
•
16
Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models
Paper
•
2311.08692
•
Published
•
12
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
•
2401.06066
•
Published
•
43
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper
•
2210.05144
•
Published
•
2
Direct Neural Machine Translation with Task-level Mixture of Experts
models
Paper
•
2310.12236
•
Published
•
2
Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting
Pre-trained Language Models
Paper
•
2310.16240
•
Published
•
1
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks
Paper
•
2401.02731
•
Published
•
2
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition
Paper
•
2209.08326
•
Published
•
1
Mixture-of-experts VAEs can disregard variation in surjective multimodal
data
Paper
•
2204.05229
•
Published
•
1
One Model, Multiple Modalities: A Sparsely Activated Approach for Text,
Sound, Image, Video and Code
Paper
•
2205.06126
•
Published
•
1
Specialized Language Models with Cheap Inference from Limited Domain
Data
Paper
•
2402.01093
•
Published
•
45
BlackMamba: Mixture of Experts for State-Space Models
Paper
•
2402.01771
•
Published
•
23
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper
•
2402.01739
•
Published
•
26
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
•
2312.17238
•
Published
•
7
Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference
Paper
•
2401.08383
•
Published
•
1
A Review of Sparse Expert Models in Deep Learning
Paper
•
2209.01667
•
Published
•
3
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Paper
•
2308.10110
•
Published
•
2
On the Representation Collapse of Sparse Mixture of Experts
Paper
•
2204.09179
•
Published
•
1
StableMoE: Stable Routing Strategy for Mixture of Experts
Paper
•
2204.08396
•
Published
•
1
DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
Paper
•
2106.03760
•
Published
•
3
CPM-2: Large-scale Cost-effective Pre-trained Language Models
Paper
•
2106.10715
•
Published
•
1
Demystifying Softmax Gating Function in Gaussian Mixture of Experts
Paper
•
2305.03288
•
Published
•
1
Statistical Perspective of Top-K Sparse Softmax Gating Mixture of
Experts
Paper
•
2309.13850
•
Published
•
1
Sparse Mixture-of-Experts are Domain Generalizable Learners
Paper
•
2206.04046
•
Published
•
1
Unified Scaling Laws for Routed Language Models
Paper
•
2202.01169
•
Published
•
2
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
•
2206.02770
•
Published
•
3
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
Routing Policy
Paper
•
2310.01334
•
Published
•
3
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper
•
2202.08906
•
Published
•
2
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts
Paper
•
2309.04354
•
Published
•
13
A non-asymptotic approach for model selection via penalization in
high-dimensional mixture of experts models
Paper
•
2104.02640
•
Published
•
1
Non-asymptotic oracle inequalities for the Lasso in high-dimensional
mixture of experts
Paper
•
2009.10622
•
Published
•
1
Fast Feedforward Networks
Paper
•
2308.14711
•
Published
•
2
Mixture-of-Experts with Expert Choice Routing
Paper
•
2202.09368
•
Published
•
3
Go Wider Instead of Deeper
Paper
•
2107.11817
•
Published
•
1
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
Paper
•
2205.12399
•
Published
•
1
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
Paper
•
2205.12410
•
Published
•
1
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Paper
•
2402.08609
•
Published
•
35
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
•
2402.07033
•
Published
•
16
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Scattered Mixture-of-Experts Implementation
Paper
•
2403.08245
•
Published
•
1
Sparse Universal Transformer
Paper
•
2310.07096
•
Published
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
59
LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World
Knowledge in Language Model Alignment
Paper
•
2312.09979
•
Published
•
1
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
•
2404.07413
•
Published
•
36
Learning to Route Among Specialized Experts for Zero-Shot Generalization
Paper
•
2402.05859
•
Published
•
5
MoELoRA: Contrastive Learning Guided Mixture of Experts on
Parameter-Efficient Fine-Tuning for Large Language Models
Paper
•
2402.12851
•
Published
•
2
MoEUT: Mixture-of-Experts Universal Transformers
Paper
•
2405.16039
•
Published
Yuan 2.0-M32: Mixture of Experts with Attention Router
Paper
•
2405.17976
•
Published
•
18
Enhancing Fast Feed Forward Networks with Load Balancing and a Master
Leaf Node
Paper
•
2405.16836
•
Published
Mixture of A Million Experts
Paper
•
2407.04153
•
Published
•
4