9 9 41

Nicolas-BZRD

AI & ML interests

PhD Student | NLP - LLMs - Adaptation real-world problem Optimization

Recent Activity

commented on an article about 14 hours ago

Introducing EuroBERT: A High-Performance Multilingual Encoder Model

commented on an article about 14 hours ago

Introducing EuroBERT: A High-Performance Multilingual Encoder Model

new activity about 17 hours ago

EuroBERT/EuroBERT-610m:Fix link to evaluation section

View all activity

Organizations

Nicolas-BZRD's activity

commented on Introducing EuroBERT: A High-Performance Multilingual Encoder Model about 14 hours ago

We are working on the next model, which covers all European languages. Training the previous model with a restricted number of languages helped us better understand the impact of their distribution during training and the curse of multilinguality while maximizing population coverage.

We also released the code base and look forward to see the community adding more languages 🤗

commented on Introducing EuroBERT: A High-Performance Multilingual Encoder Model about 14 hours ago

ModernBERT is English-only. We achieve similar performance in English with our small model (which is slightly larger than ModernBERT) and better performance with our medium and large models. For multilingual tasks, we obtain superior results. However, since comparing ModernBERT on multilingual data is less meaningful, we chose not to report those results. For math and code, the comparison is more relevant, so we included it. However, you are right—we will add the results in the appendix.

New activity in EuroBERT/EuroBERT-610m about 17 hours ago

Fix link to evaluation section

#1 opened 2 days ago by

tomaarsen

New activity in EuroBERT/EuroBERT-2.1B about 17 hours ago

fixed model name

#5 opened 1 day ago by

KennethEnevoldsen

upvoted a paper about 17 hours ago

Training Sparse Mixture Of Experts Text Embedding Models

Paper • 2502.07972 • Published 28 days ago • 4

reacted to tomaarsen's post with ❤️ 1 day ago

Post

4797

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!