Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
LocutusqueΒ 
posted an update Mar 27
Post
2639
Exciting news! πŸŽ‰ I've created the OpenCerebrum datasets, open-source alternatives to Aether Research's proprietary Cerebrum dataset.

The first, OpenCerebrum SFT, is a text-generation and question-answering dataset with ~1.2M examples, curated from sources like Open-Orca, glaiveai, camel-ai, and more! πŸ“š

The second, OpenCerebrum DPO, is a smaller dataset with ~21k examples, focusing on data point optimization. It's curated from sources like jondurbin, argilla, grimulkan, and others. πŸ“Š

Both datasets are licensed under Apache-2.0 and are available in English. They're ready for use in your projects, and I welcome any feedback for future improvements! πŸš€

Locutusque/OpenCerebrum-dpo
Locutusque/OpenCerebrum-SFT
Locutusque/OpenCerebrum-1.0-7b-SFT
Locutusque/OpenCerebrum-1.0-7b-DPO

feels a bit disingenous to try and claim that it's an "Open Cerebrum" to me? the entire point of cerebrum's work, from my perspective, is their dataset in the first place w/ its relatively small size, targeted concepts, and (presumably) human-written-ness (or at least it's what they imply). a collection of synthetic data from random datasets, even with care taken to filter things around, doesn't reaaaally feel very close to me?

regardless, nice work! even if it's not an exact replication in my book it could always be useful for something

Β·

Your right. I did mention this in the dataset card that it does not match the size of the Cerebrum dataset, and is something I'm going to try to achieve in the future, and this is used as a way to sort of test how I would go about structuring such a dataset. For now I'm trying to achieve the same performance, then I'll work towards structuring it similarly to the Cerebrum dataset. Thank you for holding me accountable about this.

This is very cool! @dvilasuero check this out!

Super cool release, thank you for sharing these datasets with the community! I'm not familiar with Aether Research or their Cerebrum dataset - is this something that has been used to train other open LLMs?

Β·

https://huggingface.co/AetherResearch/Cerebrum-1.0-7b. As I had mentioned earlier, although it's a bit different from the proprietary dataset created by Aether Research, this is used as a foundation to hopefully achieve that in the future.