smol-explorers

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

smol-explorers's activity

anton-lΒ 
posted an update 1 day ago
view post
Post
1037
Introducing πŸ“π…π’π§πžπŒπšπ­π‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
πŸ› οΈ carefully extracting math data from Common Crawl;
πŸ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! πŸš€
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
thomwolfΒ 
posted an update 11 days ago
view post
Post
4235
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
Β·
thomwolfΒ 
posted an update 14 days ago
thomwolfΒ 
posted an update 16 days ago
anditoΒ 
posted an update 22 days ago
view post
Post
1773
SmolVLM speeding locally on a laptop thanks to mlx-vlm and
@Gradio ! Try it with two lines:
pip install git+https://github.com/andimarafioti/mlx-vlm.git@stream-generate-fix
python -m mlx_vlm.chat_ui --model mlx-community/SmolVLM-Instruct-8bit

Gotta love the MLX community! Big thanks to @pcuenq and @prince_canuma !
anditoΒ 
posted an update 23 days ago
view post
Post
3220
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🀯
- Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! πŸš€
- SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU!
- SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!

Check out more!
Demo: HuggingFaceTB/SmolVLM
Blog: https://huggingface.co/blog/smolvlm
Model: HuggingFaceTB/SmolVLM-Instruct
Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
loubnabnlΒ 
posted an update 26 days ago
view post
Post
1570
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit πŸ› οΈ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
thomwolfΒ 
posted an update 26 days ago
thomwolfΒ 
posted an update about 1 month ago
thomwolfΒ 
posted an update about 2 months ago
view post
Post
4107
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around πŸ€–βœ¨
  • 2 replies
Β·
plagussΒ 
posted an update 3 months ago
anditoΒ 
posted an update 3 months ago
view post
Post
1081
Hugging face presents FineVideo πŸŽ₯! Unlocking the next generation of Video understanding πŸš€

🀯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs.
πŸ”₯
@mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.

Very psyched to fine-tune idefics on this dataset. ⚑️
Explore the videos: HuggingFaceFV/FineVideo-Explorer
anditoΒ 
posted an update 4 months ago
view post
Post
1614
πŸš€ Introducing Hugging Face's Multilingual Speech-to-Speech! 🎀
πŸ’¬Our modular, cross-platform pipeline to run GPT4o-like experiences on device can now seamlessly switch languages mid-conversation with an imperceptible 100ms delay.

🌟 Building on an amazing early reception with 2600 stars on GitHub 🌟
πŸš€ We are expanding the library to support multiple languages
πŸ”₯ Try it out with a flag: --language fr
🀯 Or don't set the flag and let the system detect the language

πŸ’‘ What feature should we add next?
  • 1 reply
Β·