CamemBERT(a)-v2: A Smarter French Language Model Aged to Perfection

CamemBERTv2 is a French language model pretrained on a large corpus of 275B tokens of French text. It is the second version of the CamemBERT model, which is based on the RoBERTa architecture. CamemBERTv2 is trained using the Masked Language Modeling (MLM) objective with 40% mask rate for 3 epochs on 32 H100 GPUs. The dataset used for training is a combination of French OSCAR dumps from the CulturaX Project, French scientific documents from HALvest, and the French Wikipedia.

The model is a drop-in replacement for the original CamemBERT model. Note that the new tokenizer is different from the original CamemBERT tokenizer, so you will need to use Fast Tokenizers to use the model. It will work with CamemBERTTokenizerFast from transformers library even if the original CamemBERTTokenizer was sentencepiece-based.

Model Checkpoints

This repository contains all intermediate model checkpoints with corresponding checkpoints in TF and PT structured as follows:

โ”œโ”€โ”€ checkpoints/
โ”‚   โ”œโ”€โ”€ iter_ckpt_rank_XX/ # Contains all iterator checkpoints from a specific rank
โ”‚   โ”œโ”€โ”€ summaries/ # Tensorboard logs
โ”‚   โ”œโ”€โ”€ ckpt-YYYYY.data-00000-of-00001
โ”‚   โ”œโ”€โ”€ ckpt-YYYYY.index
โ”œโ”€โ”€ post/
โ”‚   โ”œโ”€โ”€ ckpt-YYYYY/
โ”‚   โ”‚   โ”œโ”€โ”€ pt/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ config.json
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ pytorch_model.bin
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ special_tokens_map.json
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ tokenizer.json
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ tokenizer_config.json
โ”‚   โ”‚   โ”œโ”€โ”€ tf/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ...

Citation

@misc{antoun2024camembert20smarterfrench,
      title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
      author={Wissam Antoun and Francis Kulumba and Rian Touchent and ร‰ric de la Clergerie and Benoรฎt Sagot and Djamรฉ Seddah},
      year={2024},
      eprint={2411.08868},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.08868},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train almanach/camembertv2-base-ckpts