Byerose's picture

3 17

Byerose

Byerose

·

byerose

AI & ML interests

None yet

Recent Activity

reacted to Kseniase's post with ❤️ 3 days ago

10 Free Comprehensive Datasets for Supervised Fine-Tuning High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes. So today, we invite you to explore top 10 free datasets on natural language processing and maths: 1. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset. 2. https://huggingface.co/datasets/HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation. 3. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages. 4. https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation. 5. https://huggingface.co/datasets/yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford. 6. https://huggingface.co/datasets/lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models. 7. https://huggingface.co/datasets/allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Math datasets: 1. https://huggingface.co/datasets/HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens. 2. https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K for training O1-like LLMs. 3. https://huggingface.co/datasets/openai/gsm8k for training multi-step reasoning.

reacted to Kseniase's post with 👍 3 days ago

10 Free Comprehensive Datasets for Supervised Fine-Tuning High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes. So today, we invite you to explore top 10 free datasets on natural language processing and maths: 1. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset. 2. https://huggingface.co/datasets/HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation. 3. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages. 4. https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation. 5. https://huggingface.co/datasets/yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford. 6. https://huggingface.co/datasets/lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models. 7. https://huggingface.co/datasets/allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Math datasets: 1. https://huggingface.co/datasets/HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens. 2. https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K for training O1-like LLMs. 3. https://huggingface.co/datasets/openai/gsm8k for training multi-step reasoning.

liked a model about 2 months ago

allenai/wildguard

View all activity

Organizations

None yet

Byerose's activity

liked a model about 2 months ago

allenai/wildguard

Text Generation • Updated Jul 3, 2024 • 14.8k • 16

liked a model 2 months ago

cognitivecomputations/WizardLM-13B-Uncensored

Text Generation • Updated May 12, 2023 • 631 • 560

liked a Space 2 months ago

ChatReviewer

liked a model 2 months ago

mistralai/Mistral-7B-Instruct-v0.3

Text Generation • Updated Aug 21, 2024 • 4.13M • • 1.22k

liked 4 models 5 months ago

mistralai/Codestral-22B-v0.1

Text Generation • Updated Jul 31, 2024 • 3.46M • 1.17k

deepseek-ai/DeepSeek-Coder-V2-Instruct

Text Generation • Updated Aug 21, 2024 • 153k • 520

THUDM/cogvlm2-llama3-chat-19B

Text Generation • Updated Sep 3, 2024 • 4.7k • 203

THUDM/cogvlm-chat-hf

Text Generation • Updated Dec 19, 2023 • 58.4k • 193

liked a dataset 8 months ago

bigcode/the-stack

Viewer • Updated Apr 13, 2023 • 546M • 5.49k • 755

liked 2 models 12 months ago

lmsys/vicuna-7b-v1.5

Text Generation • Updated Mar 13, 2024 • 388k • 319

martin-ha/toxic-comment-model

Text Classification • Updated May 6, 2022 • 543k • 61

liked 2 datasets 12 months ago

allenai/real-toxicity-prompts

Viewer • Updated Sep 30, 2022 • 99.4k • 894 • 63

Anthropic/hh-rlhf

Viewer • Updated May 26, 2023 • 169k • 8.85k • 1.24k

liked a Space almost 2 years ago

ChatGPT

liked 3 datasets about 2 years ago

ufldl-stanford/svhn

Viewer • Updated Aug 8, 2024 • 879k • 3.19k • 14

uoft-cs/cifar10

Viewer • Updated Jan 4, 2024 • 60k • 33.1k • 65

ILSVRC/imagenet-1k

Updated Jul 16, 2024 • 25.9k • 436