Papers
arxiv:2406.12034

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Published on Jun 17
· Submitted by jm-kang on Jun 20
#3 Paper of the day
Authors:
,
,
,

Abstract

We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.

Community

Paper author Paper submitter

Exploring the following question: "How can we build compositional LLMs that enjoy versatile expertise, while using minimal resources?"

We introduce Self-MoE, an approach that transforms a monolithic model into a compositional system, called MiXSE (MiXture of Self-specialized Experts).

Screenshot 2024-06-20 at 4.02.04 AM.png

Self-MoE constructs individual lightweight expert modules from scratch using synthetic data, inspired by the concept of self-specialization. Each module is integrated with the base LLM, and the entire system is enhanced by a self-optimized routing mechanism. In contrast to monolithic models, which often suffer from forgetting issues when adapted or merged under fixed, static parameters, our modular design preserves the integrity and semantics of each expert. This allows for dynamic, precise handling of various target domain tasks, boosting the model’s overall capability and adaptability.

·

Nice paper🔥Are you planning to open source this work on the hub?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.12034 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.12034 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.12034 in a Space README.md to link it from this page.

Collections including this paper 11