Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

Reacted to andito's post with 🔥 about 3 hours ago

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs. - SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! 🤯 - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! 🚀 - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos! Check out more! Demo: https://huggingface.co/spaces/HuggingFaceTB/SmolVLM Blog: https://huggingface.co/blog/smolvlm Model: https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb

liked a dataset about 3 hours ago

IGNF/PureForest

Reacted to nataliaElv's post with 👀 about 6 hours ago

Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏 At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative. Follow the link below, check if your language is listed and sign up to be a Language Lead! https://forms.gle/s9nGajBh6Pb9G72J6

View all activity

Articles

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

upvoted an article 1 day ago

Article

Let’s make a generation of amazing image generation models

By

•

1 day ago

• 28

upvoted an article 2 days ago

Article

Model2Vec: Distill a Small Fast Model from any Sentence Transformer

By

•

Oct 14

• 56

upvoted a collection 5 days ago

Models for dataset curation

8 items • Updated 3 days ago • 17

upvoted 2 papers 5 days ago

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Paper • 2411.14343 • Published 6 days ago • 7

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published 6 days ago • 36

upvoted 2 collections 6 days ago

Tulu 3 Datasets

All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 6 days ago • 43

Tulu 3 Models

All models released with Tulu 3 -- state of the art open post-training recipes. • 7 items • Updated 4 days ago • 24

upvoted a paper 6 days ago

Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

Paper • 2411.12814 • Published 8 days ago • 20

upvoted an article 6 days ago

Article

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

By

•

6 days ago

• 30

upvoted a collection 7 days ago

OpenScholar_V1

The set of models, index, data associated with the paper "OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs". • 8 items • Updated 6 days ago • 26

upvoted a paper 7 days ago

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published 8 days ago • 47

upvoted a paper 9 days ago

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published 12 days ago • 102

upvoted 4 papers 10 days ago

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Paper • 2410.23956 • Published 27 days ago • 1

SWEb: A Large Web Dataset for the Scandinavian Languages

Paper • 2410.04456 • Published Oct 6 • 1

AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model

Paper • 2411.09012 • Published 14 days ago • 1

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Paper • 2309.07462 • Published Sep 14, 2023 • 4

upvoted an article 14 days ago

Article

Releasing the largest multilingual open pretraining dataset

By

•

14 days ago

• 95

upvoted a collection 18 days ago

Dataset Exploration

3 items • Updated 17 days ago • 4

upvoted an article 18 days ago

Article

Inference Endpoints Changelog 🚀

By

•

Oct 11

• 18

upvoted a paper 26 days ago

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Paper • 2410.23331 • Published 28 days ago • 7