Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
anakin87 
posted an update 11 days ago
Post
1542
Tulu 3 SFT Mixture by AllenAI is a massive, good, multilingual dataset for fine-tuning Language Models.

Unfortunately, it was missing the "language" column.

I added it using the good old fastText.

Check out the dataset here 👉 anakin87/tulu-3-sft-mixture-with-language

@Mollel created another dataset using Glot for language detection instead of fastText.

https://huggingface.co/datasets/sartifyllc/tulu-3-sft-mixture-language-glot

Good work!

In this post