Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 66
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 27
view post Post 1229 Reply huggingface.co/DIBT is dead! Long live https://huggingface.co/data-is-better-together! We're working on some very cool projects so we're doing a bit of tidying of the Data is Better Together Hub org 🤓
view post Post 2508 Reply Excited to see my weird davanstrien/ufo-ColPali dataset featured in a video by @sabrinaesaquino ! The video covers using ColPali with Binary Quantization in Qdant to accelerate retrieval. 2x speed up with no performance drop in results 🛸Video: https://youtu.be/_A90A-grwIc?si=oB3JAhJG8VQUZGLzBlog post: https://danielvanstrien.xyz/posts/post-with-code/colpali-qdrant/2024-10-02_using_colpali_with_qdrant.html
synthetic-data-generation-demos A collection of demos for various approaches to synthetic data generation Runtime error 8 👀 Genstruct 7B Running on Zero 84 🐠 Instruction Synthesizer Running on Zero 69 🐦⬛ Magpie Running on Zero 7 💬 Bonito
sentence-transformers-from-synthetic-data Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model bigcode/self-oss-instruct-sc2-exec-filter-50k Viewer • Updated 19 days ago • 50.7k • 330 • 88 davanstrien/similarity-dataset-sc2-8b Viewer • Updated May 30 • 2.32k • 88 • 6 davanstrien/code-prompt-similarity-model Sentence Similarity • Updated May 29 • 25 • 6 davanstrien/abstract-wiki Viewer • Updated Jun 11 • 5k • 50 • 1
davanstrien/fineweb-edu-llama3-annotations-sample-5-ratings-100-raw Viewer • Updated 6 days ago • 100 • 25
davanstrien/fineweb-edu-llama3-annotations-pairs-data-sample-ranked-raw Viewer • Updated 9 days ago • 248 • 19