908 173 628

Tom Aarsen

tomaarsen

https://linkedin.com/in/tomaarsen

AI & ML interests

NLP: text embeddings, information retrieval, named entity recognition, few-shot text classification

Recent Activity

liked a model about 7 hours ago

mrm8488/ModernBERT-base-ft-fineweb-edu-annotations

new activity about 15 hours ago

answerdotai/ModernBERT-base:Error: RuntimeError: Failed to import transformers.models.modernbert.modeling_modernbert because of the following error (look up to see its traceback): Windows not yet supported for torch.compile

liked a model about 17 hours ago

sujet-ai/Fin-ModernBERT-RAG-embed-base

View all activity

Articles

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Mar 22, 2024

• 69

🪆 Introduction to Matryoshka Embedding Models

Feb 23, 2024

• 63

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

Dec 6, 2023

• 6

🕳️ Attention Sinks in LLMs for endless fluency

Oct 9, 2023

• 7

Organizations

tomaarsen's activity

upvoted an article 3 days ago

Article

Fine-tune ModernBERT for text classification using synthetic data

•

4 days ago

• 16

upvoted a collection 6 days ago

Granite 3.1 Language Models

Collection

A series of language models with 128K context length trained by IBM licensed under Apache 2.0 license. • 8 items • Updated 16 days ago • 43

upvoted a paper 13 days ago

Spectrum: Targeted Training on Signal to Noise Ratio

Paper • 2406.06623 • Published Jun 7, 2024 • 12

upvoted a paper 14 days ago

Qwen2.5 Technical Report

Paper • 2412.15115 • Published 14 days ago • 334

upvoted an article 14 days ago

Article

Use Models from the Hugging Face Hub in LM Studio

•

Nov 28, 2024

• 127

upvoted a paper 14 days ago

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 16 days ago • 116

upvoted a collection 14 days ago

ModernBERT

Collection

Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated 14 days ago • 111

upvoted an article 28 days ago

Article

Building a Local Vector Database Index with Annoy and Sentence Transformers

•

28 days ago

• 3

upvoted an article 29 days ago

Article

🐺🐦‍⬛ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

•

29 days ago

• 74

upvoted an article 30 days ago

Article

Accelerating Embedding & Reranking Models on AMD Using Infinity

•

about 1 month ago

• 4

upvoted an article about 1 month ago

Article

EuroLLM-9B

•

Dec 2, 2024

• 105

upvoted a paper about 1 month ago

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

Paper • 2411.12946 • Published Nov 20, 2024 • 20

upvoted a collection about 1 month ago

Models for dataset curation

Collection

9 items • Updated 28 days ago • 17

upvoted an article about 1 month ago

Article

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

•

Nov 21, 2024

• 35

upvoted a paper about 1 month ago

Drowning in Documents: Consequences of Scaling Reranker Inference

Paper • 2411.11767 • Published Nov 18, 2024 • 17

upvoted an article about 1 month ago

Article

Halo: Open Source Health Tracking with Wearables

•

Nov 19, 2024

• 99

upvoted an article about 2 months ago

Article

Releasing the largest multilingual open pretraining dataset

•

Nov 13, 2024

• 98

upvoted a collection about 2 months ago

Training with Prompts

Collection

See the Training with Prompts documentation for more details: https://sbert.net/examples/training/prompts/README.html • 5 items • Updated Nov 7, 2024 • 3

upvoted an article about 2 months ago

Article

Releasing Common Corpus: the largest public domain dataset for training LLMs

•

Mar 20, 2024

• 18

upvoted a collection 2 months ago

Model2Vec base models

Collection

These are the Minishlab Model2Vec base models. Load them and use them with model2vec (https://github.com/MinishLab/model2vec) or sentence-transformers • 7 items • Updated 19 days ago • 8

Tom Aarsen

AI & ML interests

Recent Activity

Articles

Finally, a Replacement for BERT: Introducing ModernBERT

Welcome Gemma 2 - Google's new open LLM

Training and Finetuning Embedding Models with Sentence Transformers v3

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

🪆 Introduction to Matryoshka Embedding Models

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

🕳️ Attention Sinks in LLMs for endless fluency

Organizations

tomaarsen's activity

Fine-tune ModernBERT for text classification using synthetic data

Use Models from the Hugging Face Hub in LM Studio

Building a Local Vector Database Index with Annoy and Sentence Transformers

🐺🐦‍⬛ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

Accelerating Embedding & Reranking Models on AMD Using Infinity

EuroLLM-9B

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

Halo: Open Source Health Tracking with Wearables

Releasing the largest multilingual open pretraining dataset

Releasing Common Corpus: the largest public domain dataset for training LLMs