--- license: mit base_model: camembert/camembert-large metrics: - precision - recall - f1 - accuracy model-index: - name: NERmembert-large-4entities results: [] datasets: - CATIE-AQ/frenchNER_4entities language: - fr widget: - text: "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." library_name: transformers pipeline_tag: token-classification co2_eq_emissions: 80 --- # NERmembert-large-4entities ## Model Description We present **NERmembert-large-4entities**, which is a [CamemBERT large](https://huggingface.co/camembert/camembert-large) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC). All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities). There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing. Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/). ## Dataset The dataset used is [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities), which represents ~385k sentences labeled in 4 categories: | Label | Examples | |:------|:-----------------------------------------------------------| | PER | "La Bruyère", "Gaspard de Coligny", "Wittgenstein" | | ORG | "UTBM", "American Airlines", "id Software" | | LOC | "République du Cap-Vert", "Créteil", "Bordeaux" | | MISC | "Wolfenstein 3D", "Révolution française", "Coupe du monde" | The distribution of the entities is as follows:

Splits	O	PER	LOC	ORG	MISC
train	7,539,692	307,144	286,746	127,089	799,494
validation	544,580	24,034	21,585	5,927	18,221
test	720,623	32,870	29,683	7,911	21,760

## Evaluation results The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package. ### frenchNER_4entities For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner	0.971	0.947	0.902	0.663
cmarkea/distilcamembert-base-ner	0.974	0.948	0.892	0.658
NERmembert-base-3entities	0.978	0.957	0.904	0
NERmembert-base-4entities	0.978	0.958	0.903	0.814
NERmembert-large-4entities (this model)	0.982	0.964	0.919	0.834

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.952	0.924	0.870	0.845	0.986	0.976
	Recall	0.990	0.972	0.938	0.546	0.992	0.976
	F1	0.971	0.947	0.902	0.663	0.989	0.976
cmarkea/distilcamembert-base-ner	Precision	0.962	0.933	0.857	0.830	0.985	0.976
	Recall	0.987	0.963	0.930	0.545	0.993	0.976
	F1	0.974	0.948	0.892	0.658	0.989	0.976
NERmembert-base-3entities	Precision	0.973	0.955	0.886	0	X	X
	Recall	0.983	0.960	0.923	0	X	X
	F1	0.978	0.957	0.904	0	X	X
NERmembert-base-4entities	Precision	0.973	0.951	0.888	0.850	0.993	0.984
	Recall	0.983	0.964	0.918	0.781	0.993	0.984
	F1	0.978	0.958	0.903	0.814	0.993	0.984
NERmembert-large-4entities (this model)	Precision	0.977	0.961	0.896	0.872	0.993	0.986
	Recall	0.987	0.966	0.943	0.798	0.995	0.986
	F1	0.982	0.964	0.919	0.834	0.994	0.986

In detail: ### multiconer For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner	0.940	0.761	0.723	0.560
cmarkea/distilcamembert-base-ner	0.921	0.748	0.694	0.530
NERmembert-base-3entities	0.960	0.887	0.877	0
NERmembert-base-4entities	0.960	0.890	0.867	0.852
NERmembert-large-4entities (this model)	0.969	0.919	0.904	0.864

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.908	0.717	0.753	0.620	0.936	0.889
	Recall	0.975	0.811	0.696	0.511	0.938	0.889
	F1	0.940	0.761	0.723	0.560	0.937	0.889
cmarkea/distilcamembert-base-ner	Precision	0.885	0.738	0.737	0.589	0.928	0.881
	Recall	0.960	0.759	0.655	0.482	0.939	0.881
	F1	0.921	0.748	0.694	0.530	0.934	0.881
NERmembert-base-3entities	Precision	0.957	0.894	0.876	0	X	X
	Recall	0.962	0.880	0.878	0	X	X
	F1	0.960	0.887	0.877	0	X	X
NERmembert-base-4entities	Precision	0.954	0.893	0.851	0.849	0.979	0.954
	Recall	0.967	0.887	0.883	0.855	0.974	0.954
	F1	0.960	0.890	0.867	0.852	0.977	0.954
NERmembert-large-4entities (this model)	Precision	0.964	0.922	0.904	0.856	0.981	0.961
	Recall	0.975	0.917	0.904	0.872	0.976	0.961
	F1	0.969	0.919	0.904	0.864	0.978	0.961

### multinerd For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner	0.962	0.934	0.888	0.419
cmarkea/distilcamembert-base-ner	0.972	0.938	0.884	0.430
NERmembert-base-3entities	0.985	0.973	0.938	0
NERmembert-base-4entities	0.985	0.973	0.938	0.770
NERmembert-large-4entities (this model)	0.987	0.976	0.948	0.790

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.931	0.893	0.827	0.725	0.979	0.966
	Recall	0.994	0.980	0.959	0.295	0.990	0.966
	F1	0.962	0.934	0.888	0.419	0.984	0.966
cmarkea/distilcamembert-base-ner	Precision	0.954	0.908	0.817	0.705	0.977	0.967
	Recall	0.991	0.969	0.963	0.310	0.990	0.967
	F1	0.972	0.938	0.884	0.430	0.984	0.967
NERmembert-base-3entities	Precision	0.974	0.965	0.910	0	X	X
	Recall	0.995	0.981	0.968	0	X	X
	F1	0.985	0.973	0.938	0	X	X
NERmembert-base-4entities	Precision	0.976	0.961	0.91	0.829	0.991	0.983
	Recall	0.994	0.985	0.967	0.719	0.993	0.983
	F1	0.985	0.973	0.938	0.770	0.992	0.983
NERmembert-large-4entities (this model)	Precision	0.979	0.967	0.922	0.852	0.991	0.985
	Recall	0.996	0.986	0.974	0.736	0.994	0.985
	F1	0.987	0.976	0.948	0.790	0.993	0.985

### wikiner For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner	0.986	0.966	0.938	0.938
cmarkea/distilcamembert-base-ner	0.983	0.964	0.925	0.926
NERmembert-base-3entities	0.970	0.945	0.878	0
NERmembert-base-4entities	0.970	0.945	0.876	0.872
NERmembert-large-4entities (this model)	0.975	0.953	0.896	0.893

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner	Precision	0.986	0.962	0.925	0.943	0.998	0.992
	Recall	0.987	0.969	0.951	0.933	0.997	0.992
	F1	0.986	0.966	0.938	0.938	0.998	0.992
cmarkea/distilcamembert-base-ner	Precision	0.982	0.964	0.910	0.942	0.997	0.991
	Recall	0.985	0.963	0.940	0.910	0.998	0.991
	F1	0.983	0.964	0.925	0.926	0.997	0.991
NERmembert-base-3entities	Precision	0.971	0.947	0.866	0	X	X
	Recall	0.969	0.943	0.891	0	X	X
	F1	0.970	0.945	0.878	0	X	X
NERmembert-base-4entities	Precision	0.970	0.944	0.872	0.878	0.996	0.986
	Recall	0.969	0.947	0.880	0.866	0.996	0.986
	F1	0.970	0.945	0.876	0.872	0.996	0.986
NERmembert-large-4entities (this model)	Precision	0.975	0.957	0.872	0.901	0.997	0.989
	Recall	0.975	0.949	0.922	0.884	0.997	0.989
	F1	0.975	0.953	0.896	0.893	0.997	0.989

## Usage ### Code ```python from transformers import pipeline ner = pipeline('token-classification', model='CATIE-AQ/NERmembert-large-4entities', tokenizer='CATIE-AQ/NERmembert-large-4entities', aggregation_strategy="simple") results = ner( "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." ) print(result) ``` ```python ``` ### Try it through Space A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/NERmembert). ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:------:|:---------------:|:---------:|:------:|:------:|:--------:| | 0.0347 | 1.0 | 41095 | 0.0537 | 0.9832 | 0.9832 | 0.9832 | 0.9832 | | 0.0237 | 2.0 | 82190 | 0.0448 | 0.9858 | 0.9858 | 0.9858 | 0.9858 | | 0.0119 | 3.0 | 123285 | 0.0532 | 0.9860 | 0.9860 | 0.9860 | 0.9860 | ### Framework versions - Transformers 4.36.2 - Pytorch 2.1.2 - Datasets 2.16.1 - Tokenizers 0.15.0 ## Environmental Impact *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.* - **Hardware Type:** A100 PCIe 40/80GB - **Hours used:** 4h17min - **Cloud Provider:** Private Infrastructure - **Carbon Efficiency (kg/kWh):** 0.078 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of January 10, 2024.) - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.08 kg eq. CO2 ## Citations ### NERmembert-large-4entities ``` TODO ``` ### multiconer > @inproceedings{multiconer2-report, title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}}, author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin}, booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)}, year={2023}, publisher={Association for Computational Linguistics}} > @article{multiconer2-data, title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}}, author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin}, year={2023}} ### multinerd > @inproceedings{tedeschi-navigli-2022-multinerd, title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)", author = "Tedeschi, Simone and Navigli, Roberto", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-naacl.60", doi = "10.18653/v1/2022.findings-naacl.60", pages = "801--812"} ### pii-masking-200k > @misc {ai4privacy_2023, author = { {ai4Privacy} }, title = { pii-masking-200k (Revision 1d4c0a1) }, year = 2023, url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k }, doi = { 10.57967/hf/1532 }, publisher = { Hugging Face }} ### wikiner > @article{NOTHMAN2013151, title = {Learning multilingual named entity recognition from Wikipedia}, journal = {Artificial Intelligence}, volume = {194}, pages = {151-175}, year = {2013}, note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources}, issn = {0004-3702}, doi = {https://doi.org/10.1016/j.artint.2012.03.006}, url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276}, author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}} ### frenchNER_4entities ``` TODO ``` ### CamemBERT > @inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}} ## License [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)