Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
3
17
Byerose
Byerose
Follow
21world's profile picture
1 follower
·
0 following
byerose
AI & ML interests
None yet
Recent Activity
reacted
to
Kseniase
's
post
with ❤️
3 days ago
10 Free Comprehensive Datasets for Supervised Fine-Tuning High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes. So today, we invite you to explore top 10 free datasets on natural language processing and maths: 1. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset. 2. https://huggingface.co/datasets/HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation. 3. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages. 4. https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation. 5. https://huggingface.co/datasets/yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford. 6. https://huggingface.co/datasets/lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models. 7. https://huggingface.co/datasets/allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Math datasets: 1. https://huggingface.co/datasets/HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens. 2. https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K for training O1-like LLMs. 3. https://huggingface.co/datasets/openai/gsm8k for training multi-step reasoning.
reacted
to
Kseniase
's
post
with 👍
3 days ago
10 Free Comprehensive Datasets for Supervised Fine-Tuning High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes. So today, we invite you to explore top 10 free datasets on natural language processing and maths: 1. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset. 2. https://huggingface.co/datasets/HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation. 3. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages. 4. https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation. 5. https://huggingface.co/datasets/yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford. 6. https://huggingface.co/datasets/lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models. 7. https://huggingface.co/datasets/allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Math datasets: 1. https://huggingface.co/datasets/HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens. 2. https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K for training O1-like LLMs. 3. https://huggingface.co/datasets/openai/gsm8k for training multi-step reasoning.
liked
a model
about 2 months ago
allenai/wildguard
View all activity
Organizations
None yet
Byerose
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
liked
a model
about 2 months ago
allenai/wildguard
Text Generation
•
Updated
Jul 3, 2024
•
14.8k
•
16
liked
a model
2 months ago
cognitivecomputations/WizardLM-13B-Uncensored
Text Generation
•
Updated
May 12, 2023
•
631
•
560
liked
a Space
2 months ago
Running
108
💩
ChatReviewer
liked
a model
2 months ago
mistralai/Mistral-7B-Instruct-v0.3
Text Generation
•
Updated
Aug 21, 2024
•
4.13M
•
•
1.22k
liked
4 models
5 months ago
mistralai/Codestral-22B-v0.1
Text Generation
•
Updated
Jul 31, 2024
•
3.46M
•
1.17k
deepseek-ai/DeepSeek-Coder-V2-Instruct
Text Generation
•
Updated
Aug 21, 2024
•
153k
•
520
THUDM/cogvlm2-llama3-chat-19B
Text Generation
•
Updated
Sep 3, 2024
•
4.7k
•
203
THUDM/cogvlm-chat-hf
Text Generation
•
Updated
Dec 19, 2023
•
58.4k
•
193
liked
a dataset
8 months ago
bigcode/the-stack
Viewer
•
Updated
Apr 13, 2023
•
546M
•
5.49k
•
755
liked
2 models
12 months ago
lmsys/vicuna-7b-v1.5
Text Generation
•
Updated
Mar 13, 2024
•
388k
•
319
martin-ha/toxic-comment-model
Text Classification
•
Updated
May 6, 2022
•
543k
•
61
liked
2 datasets
12 months ago
allenai/real-toxicity-prompts
Viewer
•
Updated
Sep 30, 2022
•
99.4k
•
894
•
63
Anthropic/hh-rlhf
Viewer
•
Updated
May 26, 2023
•
169k
•
8.85k
•
1.24k
liked
a Space
almost 2 years ago
Runtime error
139
📊
ChatGPT
liked
3 datasets
about 2 years ago
ufldl-stanford/svhn
Viewer
•
Updated
Aug 8, 2024
•
879k
•
3.19k
•
14
uoft-cs/cifar10
Viewer
•
Updated
Jan 4, 2024
•
60k
•
33.1k
•
65
ILSVRC/imagenet-1k
Updated
Jul 16, 2024
•
25.9k
•
436