view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • about 15 hours ago • 21
view article Article Open-source embeddings and LLMs outperform Gemini and OpenAI for Web Navigation while being faster and cheaper By dhuynh95 • 5 days ago • 4
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 30 days ago • 103
Graph-enhanced RAG Collection using knowledge graphs in RAG for grounding LLM results • 18 items • Updated 5 days ago • 3
Arabic Matryoshka Embedding Models Collection A collection of advanced Arabic Matryoshka Embedding Models designed for efficient and high-performance Arabic NLP, available publicly on Hugging Face • 6 items • Updated about 22 hours ago • 2
GPL BEIR Datasets Collection Generative Pseudo Labeling training datasets for all domains in BEIR. • 15 items • Updated Apr 28 • 1
🦢SWIM-IR Dataset Collection 29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs. • 4 items • Updated Apr 28 • 7
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval Paper • 2311.05800 • Published Nov 10, 2023 • 3
miniMiracle dense retrievers Collection Low foot print multilingual retrievers, all-minilm-* equivalent. • 3 items • Updated 18 days ago • 5
view article Article Introducing the Hugging Face Embedding Container for Amazon SageMaker 20 days ago • 9
view article Article Introducing NPC-Playground, a 3D playground to interact with LLM-powered NPCs 22 days ago • 12
Nomic Embed Vision Collection Vision Encoders aligned to Nomic Embed Text making Nomic Embed multimodal! • 2 items • Updated 21 days ago • 4
Hugging Face community’s Wikimedia datasets Collection Wikimedia datasets created by the Hugging Face community, not Wikimedia. Sorted by Wikimedia project. • 17 items • Updated 19 days ago • 6
Arabic NoRobots DPO Datasets Collection Our synthetic DPO datasets for Arabic NoRobots. • 4 items • Updated 28 days ago • 4
view article Article How to Fine-Tune Custom Embedding Models Using AutoTrain By abhishek • 27 days ago • 10
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations Paper • 2405.18392 • Published 29 days ago • 12
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated 5 days ago • 20
C4AI Aya 23 Collection Aya 23 is an open weights research release of an instruction fine-tuned model with highly advanced multilingual capabilities. • 3 items • Updated May 23 • 37
INDUS: Effective and Efficient Language Models for Scientific Applications Paper • 2405.10725 • Published May 17 • 23
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Paper • 2405.09496 • Published May 15 • 3
MS MARCO Mined Triplets Collection These datasets contain MS MARCO Triplets gathered by mining hard negatives using various models. Each dataset has various subsets. • 14 items • Updated May 21 • 6
Parallel Sentences Datasets Collection These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual. • 13 items • Updated 8 days ago • 4
Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 66 items • Updated 5 days ago • 37
Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training Paper • 2405.06932 • Published May 11 • 15
NuNerZero - Zero Shot NER Collection The best compact Zero-Shot NER models with MIT license • 4 items • Updated 1 day ago • 14
view article Article Train Custom Models on Hugging Face Spaces with AutoTrain SpaceRunner By abhishek • May 9 • 7
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 27
view article Article 🧑⚖️ "Replacing Judges with Juries" using distilabel By alvarobartt • May 3 • 15
view article Article Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent Apr 22 • 75
🇫🇷 Cross-encoder rerankers Collection A collection of cross-encoder reranking models in French. • 31 items • Updated May 6 • 3
🇫🇷 Single-vector dense bi-encoders Collection A collection of single-vector dense representation models in French. • 16 items • Updated 25 days ago • 2
Llama3-ChatQA-1.5 Collection Llama3-ChatQA-1.5 models excel at conversational question answering (QA) and retrieval-augmented generation (RAG). • 6 items • Updated 12 days ago • 37
view article Article StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation Apr 29 • 70
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • 22 days ago • 63
Arctic Collection A collection of pre-trained dense-MoE Hybrid transformer models • 2 items • Updated Apr 24 • 20
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 239
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated 27 days ago • 345
LongEmbed: Extending Embedding Models for Long Context Retrieval Paper • 2404.12096 • Published Apr 18 • 2
Arctic-embed Collection A collection of text embedding models optimized for retrieval accuracy and efficiency • 5 items • Updated Apr 17 • 11
view article Article Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval Mar 22 • 43
Vector-io compatible Datasets Collection These datasets can be loaded into your vector database with a single line bash command • 14 items • Updated Mar 29 • 3
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer Paper • 2311.08526 • Published Nov 14, 2023 • 8