Nicolas-BZRD

Nicolas-BZRD

AI & ML interests

PhD Student | NLP - LLMs - Adaptation real-world problem Optimization

Recent Activity

Organizations

CroissantLLM's profile picture UTTER - Unified Transcription and Translation for Extended Reality's profile picture Diabolocom's profile picture EuroBert's profile picture EuroBERT's profile picture

Nicolas-BZRD's activity

view reply

We are working on the next model, which covers all European languages. Training the previous model with a restricted number of languages helped us better understand the impact of their distribution during training and the curse of multilinguality while maximizing population coverage.

We also released the code base and look forward to see the community adding more languages 🤗

view reply

ModernBERT is English-only. We achieve similar performance in English with our small model (which is slightly larger than ModernBERT) and better performance with our medium and large models. For multilingual tasks, we obtain superior results. However, since comparing ModernBERT on multilingual data is less meaningful, we chose not to report those results. For math and code, the comparison is more relevant, so we included it. However, you are right—we will add the results in the appendix.

New activity in EuroBERT/EuroBERT-610m about 17 hours ago
New activity in EuroBERT/EuroBERT-2.1B about 17 hours ago
reacted to tomaarsen's post with ❤️ 1 day ago
view post
Post
4797
An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
  • 1 reply
·
upvoted an article 2 days ago
view article
Article

Introducing EuroBERT: A High-Performance Multilingual Encoder Model

By EuroBERT and 3 others
103
published an article 2 days ago
view article
Article

Introducing EuroBERT: A High-Performance Multilingual Encoder Model

By EuroBERT and 3 others
103
updated a Space 5 days ago
published a Space 5 days ago