Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
Ashish08
's Collections
Microsoft Models
Embedding_Models
Leaderboards
MoE_Models
Meta AI
Pre-Training-Data-for-LLMs
Privacy_Masking_for_LLMs
Pre-Training-Data-for-LLMs
updated
Aug 16
Open-Source Datasets that have been employed for pre-training Large Language Models
Upvote
-
tiiuae/falcon-refinedweb
Viewer
•
Updated
Jun 20, 2023
•
968M
•
24.8k
•
814
togethercomputer/RedPajama-Data-1T
Viewer
•
Updated
Jun 17
•
1.73M
•
1.56k
•
1.06k
mikex86/stackoverflow-posts
Viewer
•
Updated
Aug 1, 2023
•
58.3M
•
2.48k
•
45
Upvote
-
Share collection
View history
Collection guide
Browse collections