Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 β’ 67
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 27
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
view article Article Letβs make a generation of amazing image generation models By burtenshaw β’ 1 day ago β’ 28
view article Article Model2Vec: Distill a Small Fast Model from any Sentence Transformer By Pringled β’ Oct 14 β’ 56
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages Paper β’ 2411.14343 β’ Published 6 days ago β’ 7
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper β’ 2411.14402 β’ Published 6 days ago β’ 36
Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. β’ 32 items β’ Updated 6 days ago β’ 43
Tulu 3 Models Collection All models released with Tulu 3 -- state of the art open post-training recipes. β’ 7 items β’ Updated 4 days ago β’ 24
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline Paper β’ 2411.12814 β’ Published 8 days ago β’ 20
view article Article Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK By davidberenstein1957 β’ 6 days ago β’ 30
OpenScholar_V1 Collection The set of models, index, data associated with the paper "OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs". β’ 8 items β’ Updated 6 days ago β’ 26
RedPajama: an Open Dataset for Training Large Language Models Paper β’ 2411.12372 β’ Published 8 days ago β’ 47
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper β’ 2411.10440 β’ Published 12 days ago β’ 102
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language Paper β’ 2410.23956 β’ Published 27 days ago β’ 1
SWEb: A Large Web Dataset for the Scandinavian Languages Paper β’ 2410.04456 β’ Published Oct 6 β’ 1
AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model Paper β’ 2411.09012 β’ Published 14 days ago β’ 1
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? Paper β’ 2309.07462 β’ Published Sep 14, 2023 β’ 4
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais β’ 14 days ago β’ 95
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists Paper β’ 2410.23331 β’ Published 28 days ago β’ 7