Loubna Ben Allal

loubnabnl

AI & ML interests

LLMs, ML for code, Synthetic data

Recent Activity

updated a dataset about 19 hours ago
HuggingFaceTB/smoltalk
updated a collection about 19 hours ago
SmolLM2
updated a collection about 19 hours ago
SmolLM2

Articles

Organizations

loubnabnl's activity

reacted to ArthurZ's post with 🔥 3 days ago
reacted to Xenova's post with 🔥 3 months ago
reacted to dvilasuero's post with 🚀🔥 5 months ago
view post
Post
7943
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
·
posted an update 6 months ago
view post
Post
4926
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!
reacted to thomwolf's post with 🚀🔥 6 months ago
view post
Post
4525
[New crazy blog post alert] We are releasing an extensive blog post on the science of creating high quality web-scale datasets, detailing all the steps and learnings that came in our recent 15 trillion tokens 🍷FineWeb release

Inspired by the distill.pub interactive graphics papers, we settled to write the most extensive, enjoyable and in-depth tech report we could draft on so prepare for a 45-mmin read with interactive graphics and all.

And it's not all, in this article we also introduce 📚FineWeb-Edu a filtered subset of Common Crawl with 1.3T tokens containing only web pages with very high educational content. Up to our knowledge, FineWeb-Edu out-performs all openly release web-scale datasets by a significant margin on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA

We also make a number of surprising observations on the "quality" of the internet it-self which may challenge some of the general assumptions on web data (not saying more, I'll let you draw your conclusions ;)

HuggingFaceFW/blogpost-fineweb-v1
  • 1 reply
·
reacted to clefourrier's post with 🔥 7 months ago
view post
Post
4203
Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!
reacted to Molbap's post with 🤗🚀🔥 8 months ago
view post
Post
5000
🚀🚀 Exciting times for the document AI community!

We're thrilled to announce the release of some of the largest OCR datasets available to the public.
🔥 With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.

Here's how to access these datasets quickly:

from datasets import load_dataset

pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)

This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:

pixparse/pdfa-eng-wds
pixparse/idl-wds

For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:


import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))



We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. 🤗

Looking Ahead:

We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.

For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.
·
reacted to osanseviero's post with ❤️🔥 8 months ago
view post
Post
2066
Diaries of Open Source. Part 11 🚀

🚀Databricks release DBRX, potentially the best open access model! A 132B Mixture of Experts with 36B active params and trained on 12 trillion tokens
Blog: https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Base and instruct models: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: databricks/dbrx-instruct

🤏1-bit and 2-bit quantization exploration using HQQ+
Blog post: https://mobiusml.github.io/1bit_blog/
Models: https://hf.co/collections/mobiuslabsgmbh/llama2-7b-hqq-6604257a96fc8b9c4e13e0fe
GitHub: https://github.com/mobiusml/hqq

📚Cosmopedia: a large-scale synthetic dataset for pre-training - it includes 25 billion tokens and 30 million files
Dataset: HuggingFaceTB/cosmopedia
Blog: https://hf.co/blog/cosmopedia

⭐Mini-Gemini: multi-modal VLMs, from 2B to 34B
Models: https://hf.co/collections/YanweiLi/mini-gemini-6603c50b9b43d044171d0854
Paper: Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (2403.18814)
GitHub: https://github.com/dvlab-research/MiniGemini

🔥VILA - On Pre-training for VLMs
Models: Efficient-Large-Model/vila-on-pre-training-for-visual-language-models-65d8022a3a52cd9bcd62698e
Paper: VILA: On Pre-training for Visual Language Models (2312.07533)

Misc
👀 FeatUp: a framework for image features at any resolution: mhamilton723/FeatUp FeatUp: A Model-Agnostic Framework for Features at Any Resolution (2403.10516)
🍞ColBERTus Maxiums, a colbertialized embedding model mixedbread-ai/mxbai-colbert-large-v1
🖌️Semantic Palette, a new drawing paradigm ironjr/SemanticPalette
🧑‍⚕️HistoGPT, a vision model that generates accurate pathology reports marr-peng-lab/histogpt https://www.medrxiv.org/content/10.1101/2024.03.15.24304211v1
·
posted an update 8 months ago
view post
Post
6378
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!
  • 1 reply
·
reacted to m-ric's post with 🚀🔥❤️ 8 months ago
view post
Post
2039
𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲 🧑‍⚖️ 𝗳𝗼𝗿 𝗮𝗻 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗮𝗻𝗱 𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗲 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻

Evaluating LLM outputs is often hard, since many tasks require open-ended answers for which no deterministic metrics work: for instance, when asking a model to summarize a text, there could be hundreds of correct ways to do it. The most versatile way to grade these outputs is then human evaluation, but it is very time-consuming, thus costly.

🤔 Then 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗮𝘀𝗸 𝗮𝗻𝗼𝘁𝗵𝗲𝗿 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻, by providing it relevant rating criteria? 👉 This is the idea behind LLM-as-a-judge.

⚙️ To implement a LLM judge correctly, you need a few tricks.
✅ So 𝗜'𝘃𝗲 𝗷𝘂𝘀𝘁 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱 𝗮 𝗻𝗲𝘄 𝗻𝗼𝘁𝗲𝗯𝗼𝗼𝗸 𝘀𝗵𝗼𝘄𝗶𝗻𝗴 𝗵𝗼𝘄 𝘁𝗼 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗶𝘁 𝗽𝗿𝗼𝗽𝗲𝗿𝗹𝘆 𝗶𝗻 𝗼𝘂𝗿 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲 𝗖𝗼𝗼𝗸𝗯𝗼𝗼𝗸! (you can run it instantly in Google Colab)
➡️ 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲 𝗰𝗼𝗼𝗸𝗯𝗼𝗼𝗸: https://huggingface.co/learn/cookbook/llm_judge

The Cookbook is a great collection of notebooks demonstrating recipes (thus the "cookbook") for common LLM usages. I recommend you to go take a look!
➡️ 𝗔𝗹𝗹 𝗰𝗼𝗼𝗸𝗯𝗼𝗼𝗸𝘀: https://huggingface.co/learn/cookbook/index

Thank you @MariaK for your support!
  • 2 replies
·
reacted to vishesh-t27's post with 🔥 8 months ago
view post
Post
Komodo-7B is here !! Today we are releasing the base version of Komodo-7B along with the technical report.

Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct.

Komodo-7B performers really good in multiple Indonesian languages including Indonesian, Acehnese, Balinese, Banjarese, Buginese, Dayak Ngaju, Javanese, Lampungnese, Madurese, Minangkabau, Sundanese, and Toba Batak.

Our model outperforms various existing large language models including some multilingual models.

Technical Report: https://arxiv.org/abs/2403.09362

Base Model HuggingFace: Yellow-AI-NLP/komodo-7b-base

Kudos to the team @louisowen6 , @akanyaani & @biddwan Komodo: A Linguistic Expedition into Indonesia's Regional Languages (2403.09362)
reacted to m-ric's post with 🔥 8 months ago
view post
Post
Interesting paper: 𝐆𝐚𝐋𝐨𝐫𝐞: 𝐭𝐫𝐚𝐢𝐧 𝟕𝐁 𝐦𝐨𝐝𝐞𝐥𝐬 𝐨𝐧 𝐜𝐨𝐧𝐬𝐮𝐦𝐞𝐫-𝐠𝐫𝐚𝐝𝐞 𝐆𝐏𝐔𝐬 💪
It's now possible to 𝙛𝙪𝙡𝙡𝙮 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣 a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!

The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!

The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits

Another technique is to 𝙥𝙧𝙤𝙟𝙚𝙘𝙩 𝙩𝙝𝙚 𝙬𝙚𝙞𝙜𝙝𝙩 𝙢𝙖𝙩𝙧𝙞𝙭 𝙩𝙤 𝙡𝙤𝙬𝙚𝙧-𝙧𝙖𝙣𝙠 𝙨𝙥𝙖𝙘𝙚𝙨, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.

➡️ Enter the authors of 𝘎𝘢𝘓𝘰𝘳𝘦: 𝘔𝘦𝘮𝘰𝘳𝘺-𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘓𝘓𝘔 𝘛𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘣𝘺 𝘎𝘳𝘢𝘥𝘪𝘦𝘯𝘵 𝘓𝘰𝘸-𝘙𝘢𝘯𝘬 𝘗𝘳𝘰𝘫𝘦𝘤𝘵𝘪𝘰𝘯. They gather (and prove) interesting insights:
⛔ The weight matrix does not reliably converge to lower ranks during training.
✅ But the gradient matrix does!

Based on these insights, 𝘁𝗵𝗲𝘆 𝗯𝘂𝗶𝗹𝗱 𝗚𝗮𝗟𝗼𝗿𝗲, that projects the gradient to lower ranks.
🗺️ 𝗚𝗿𝗲𝗮𝘁 𝗶𝗱𝗲𝗮: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).

🤝 This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).

➡️ 𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
📉 Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
💪 No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) ⇒ this is essential, it confirms that the method is viable!

Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)
reacted to chiphuyen's post with 🤗 8 months ago