S3Eval

Still following your human intuition to mix corpora from different sources for pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄

🔬 In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈

📄 Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
💻 Code: https://github.com/sail-sg/regmix
📊 Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
🎮 Demo: https://huggingface.co/spaces/sail/RegMix

SivilTaram

authored 11 papers 6 months ago

RegMix: Data Mixture as Regression for Language Model Pre-training

Paper • 2407.01492 • Published Jul 1 • 35

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Paper • 2406.09136 • Published Jun 13

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Paper • 2406.15877 • Published Jun 22 • 45

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

Paper • 2402.08567 • Published Feb 13 • 2

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Paper • 2406.01288 • Published Jun 3 • 1

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14 • 38

SivilTaram

posted an update 7 months ago

Post

2347

Introducing Sailor-14B Model and Sailor2 Project 🚢

We're thrilled to announce the release of the Sailor-14B models, including the Base and the Chat versions!

✅Built upon the Qwen1.5-14B model, the Base version follows a similar procedure as our Sailor-7B model.
✅The Chat version is optimized using DPO on our in-house human preference dataset, yielding a better experience than our previous Chat models.

🏠Home: https://sailorllm.github.io
🤗Model: sail/Sailor-14B-Chat
💻Demo: sail/Sailor-14B-Chat

We're also excited to introduce the Sailor2 project, ✨ an open collaboration opportunity for the entire community! ✨

🌐 The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.

🎯The model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.

🤝 Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.

🌍 Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by scanning the QR code! 🔍

Let's sail together and enjoy the journey!⚓

2 replies

AI & ML interests

Recent Activity

Team members 2

S3Eval's activity