Papers
arxiv:2201.10890

One Student Knows All Experts Know: From Sparse to Dense

Published on Jan 26, 2022
Authors:
,
,
,
,

Abstract

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, to gather key knowledge from different pre-trained experts, we first investigate four different possible knowledge gathering methods, \ie summation, averaging, Top-K Knowledge Gathering (Top-KG), and Singular Value Decomposition Knowledge Gathering (SVD-KG) proposed in this paper. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves 61.7% benefits from MoE and achieves 78.4% top-1 accuracy ImageNet with only 15M parameters. On four natural language processing datasets, OneS obtains 88.2% MoE benefits and outperforms the best baseline by 51.7% using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve 3.7 times inference speedup due to less computation and hardware-friendly architecture.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2201.10890 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2201.10890 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2201.10890 in a Space README.md to link it from this page.

Collections including this paper 3