22 41 56

Elie Bakouch

eliebak

AI & ML interests

Training LLM's @ 🤗

Recent Activity

liked a model about 3 hours ago

PowerInfer/SmallThinker-3B-Preview

liked a Space 1 day ago

reach-vb/2024-ai-timeline

updated a Space 1 day ago

reach-vb/2024-ai-timeline

View all activity

Articles

SmolVLM - small yet mighty Vision Language Model

Nov 26, 2024

• 147

SmolLM - blazingly fast and remarkably powerful

Jul 16, 2024

• 294

Organizations

eliebak's activity

liked a model about 3 hours ago

PowerInfer/SmallThinker-3B-Preview

Updated 1 day ago • 325 • 106

liked a Space 1 day ago

Running

📉

2024 AI Timeline

New activity in reach-vb/2024-ai-timeline 1 day ago

Update index.html

#5 opened 1 day ago by

eliebak

liked a Space 2 days ago

Running

297

🦀

README

reacted to Kseniase's post with 🔥 3 days ago

Post

2024

10 Free Comprehensive Datasets for Supervised Fine-Tuning

High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.

So today, we invite you to explore top 10 free datasets on natural language processing and maths:

1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.

2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.

3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.

4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.

5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.

6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.

7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Math datasets:

1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.

2. amphora/QwQ-LongCoT-130K for training O1-like LLMs.

3. openai/gsm8k for training multi-step reasoning.

liked a model 7 days ago

deepseek-ai/DeepSeek-V3-Base

Updated 3 days ago • 6.34k • 1.05k

upvoted an article 8 days ago

Article

🌁#81: Key AI Concepts to Follow in 2025

•

9 days ago

• 18

liked a model 11 days ago

answerdotai/ModernBERT-base

Fill-Mask • Updated 7 days ago • 57.6k • 565

upvoted a paper 12 days ago

Qwen2.5 Technical Report

Paper • 2412.15115 • Published 13 days ago • 333

reacted to anton-l's post with 🔥 13 days ago

Post

2081

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

updated a collection 13 days ago

📐 FineMath

Collection

FineMath datasets and ablation models • 13 items • Updated 10 days ago • 15

updated 6 models 13 days ago

Elie Bakouch

AI & ML interests

Recent Activity

Articles

SmolVLM - small yet mighty Vision Language Model

SmolLM - blazingly fast and remarkably powerful

Organizations

eliebak's activity

2024 AI Timeline

2024 AI Timeline

Update index.html

Gemini Coder

README

🌁#81: Key AI Concepts to Follow in 2025