Elie Bakouch

eliebak

AI & ML interests

Training LLM's @ πŸ€—

Recent Activity

reacted to Kseniase's post with πŸ”₯ about 10 hours ago
10 Free Comprehensive Datasets for Supervised Fine-Tuning High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes. So today, we invite you to explore top 10 free datasets on natural language processing and maths: 1. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset. 2. https://huggingface.co/datasets/HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation. 3. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages. 4. https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation. 5. https://huggingface.co/datasets/yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford. 6. https://huggingface.co/datasets/lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models. 7. https://huggingface.co/datasets/allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Math datasets: 1. https://huggingface.co/datasets/HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens. 2. https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K for training O1-like LLMs. 3. https://huggingface.co/datasets/openai/gsm8k for training multi-step reasoning.
liked a model 4 days ago
deepseek-ai/DeepSeek-V3-Base
View all activity

Articles

Organizations

Hugging Face's profile picture HuggingFaceBR4's profile picture Hugging Face H4's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture huggingPartyParis's profile picture Nanotron Research's profile picture Hugging Face SMOL's profile picture MLX Community's profile picture HuggingFaceFW's profile picture LLHF's profile picture llmc's profile picture SLLHF's profile picture Argilla Warehouse's profile picture nltpt's profile picture smol-explorers's profile picture Open Science's profile picture Hugging Face Science's profile picture open/ acc's profile picture

Posts 1

view post
Post
1144
Wow, impressive 340B model by nvidia with a nice permissive license! πŸš€ The technical report is full of insights and seems to use a different learning rate schedule than cosine, probably a variant of WSD. Hope to get more info on that! πŸ‘€

nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911