Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time Paper • 2408.13233 • Published Aug 23 • 21
Heterogeneous Multi-task Learning with Expert Diversity Paper • 2106.10595 • Published Jun 20, 2021 • 1
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition Paper • 2307.05956 • Published Jul 12, 2023 • 1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers Paper • 2211.11315 • Published Nov 21, 2022 • 1