--- license: mit base_model: camembert/camembert-large metrics: - precision - recall - f1 - accuracy model-index: - name: NERmembert-large-4entities results: [] datasets: - CATIE-AQ/frenchNER_4entities language: - fr widget: - text: "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." library_name: transformers pipeline_tag: token-classification co2_eq_emissions: 80 --- # NERmembert-large-4entities ## Model Description We present **NERmembert-large-4entities**, which is a [CamemBERT large](https://huggingface.co/camembert/camembert-large) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC). All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities). There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing. Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/). ## Dataset The dataset used is [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities), which represents ~385k sentences labeled in 4 categories: | Label | Examples | |:------|:-----------------------------------------------------------| | PER | "La Bruyère", "Gaspard de Coligny", "Wittgenstein" | | ORG | "UTBM", "American Airlines", "id Software" | | LOC | "République du Cap-Vert", "Créteil", "Bordeaux" | | MISC | "Wolfenstein 3D", "Révolution française", "Coupe du monde" | The distribution of the entities is as follows:

Splits

O

PER

LOC

ORG

MISC

train

7,539,692

307,144

286,746

127,089

799,494

validation

544,580

24,034

21,585

5,927

18,221

test

720,623

32,870

29,683

7,911

21,760
## Evaluation results The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package. ### frenchNER_4entities For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.971

0.947

0.902

0.663

cmarkea/distilcamembert-base-ner

0.974

0.948

0.892

0.658

NERmembert-base-3entities

0.978

0.957

0.904

0

NERmembert-base-4entities

0.978

0.958

0.903

0.814

NERmembert-large-4entities (this model)

0.982

0.964

0.919

0.834
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.952

0.924

0.870

0.845

0.986

0.976

Recall

0.990

0.972

0.938

0.546

0.992

0.976
F1
0.971

0.947

0.902

0.663

0.989

0.976

cmarkea/distilcamembert-base-ner

Precision

0.962

0.933

0.857

0.830

0.985

0.976

Recall

0.987

0.963

0.930

0.545

0.993

0.976
F1
0.974

0.948

0.892

0.658

0.989

0.976

NERmembert-base-3entities

Precision

0.973

0.955

0.886

0

X

X

Recall

0.983

0.960

0.923

0

X

X
F1
0.978

0.957

0.904

0

X

X

NERmembert-base-4entities

Precision

0.973

0.951

0.888

0.850

0.993

0.984

Recall

0.983

0.964

0.918

0.781

0.993

0.984
F1
0.978

0.958

0.903

0.814

0.993

0.984

NERmembert-large-4entities (this model)

Precision

0.977

0.961

0.896

0.872

0.993

0.986

Recall

0.987

0.966

0.943

0.798

0.995

0.986
F1
0.982

0.964

0.919

0.834

0.994

0.986
In detail: ### multiconer For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.940

0.761

0.723

0.560

cmarkea/distilcamembert-base-ner

0.921

0.748

0.694

0.530

NERmembert-base-3entities

0.960

0.887

0.877

0

NERmembert-base-4entities

0.960

0.890

0.867

0.852

NERmembert-large-4entities (this model)

0.969

0.919

0.904

0.864
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.908

0.717

0.753

0.620

0.936

0.889

Recall

0.975

0.811

0.696

0.511

0.938

0.889
F1
0.940

0.761

0.723

0.560

0.937

0.889

cmarkea/distilcamembert-base-ner

Precision

0.885

0.738

0.737

0.589

0.928

0.881

Recall

0.960

0.759

0.655

0.482

0.939

0.881
F1
0.921

0.748

0.694

0.530

0.934

0.881

NERmembert-base-3entities

Precision

0.957

0.894

0.876

0

X

X

Recall

0.962

0.880

0.878

0

X

X
F1
0.960

0.887

0.877

0

X

X

NERmembert-base-4entities

Precision

0.954

0.893

0.851

0.849

0.979

0.954

Recall

0.967

0.887

0.883

0.855

0.974

0.954
F1
0.960

0.890

0.867

0.852

0.977

0.954

NERmembert-large-4entities (this model)

Precision

0.964

0.922

0.904

0.856

0.981

0.961

Recall

0.975

0.917

0.904

0.872

0.976

0.961
F1
0.969

0.919

0.904

0.864

0.978

0.961
### multinerd For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.962

0.934

0.888

0.419

cmarkea/distilcamembert-base-ner

0.972

0.938

0.884

0.430

NERmembert-base-3entities

0.985

0.973

0.938

0

NERmembert-base-4entities

0.985

0.973

0.938

0.770

NERmembert-large-4entities (this model)

0.987

0.976

0.948

0.790
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.931

0.893

0.827

0.725

0.979

0.966

Recall

0.994

0.980

0.959

0.295

0.990

0.966
F1
0.962

0.934

0.888

0.419

0.984

0.966

cmarkea/distilcamembert-base-ner

Precision

0.954

0.908

0.817

0.705

0.977

0.967

Recall

0.991

0.969

0.963

0.310

0.990

0.967
F1
0.972

0.938

0.884

0.430

0.984

0.967

NERmembert-base-3entities

Precision

0.974

0.965

0.910

0

X

X

Recall

0.995

0.981

0.968

0

X

X
F1
0.985

0.973

0.938

0

X

X

NERmembert-base-4entities

Precision

0.976

0.961

0.91

0.829

0.991

0.983

Recall

0.994

0.985

0.967

0.719

0.993

0.983
F1
0.985

0.973

0.938

0.770

0.992

0.983

NERmembert-large-4entities (this model)

Precision

0.979

0.967

0.922

0.852

0.991

0.985

Recall

0.996

0.986

0.974

0.736

0.994

0.985
F1
0.987

0.976

0.948

0.790

0.993

0.985
### wikiner For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner

0.986

0.966

0.938

0.938

cmarkea/distilcamembert-base-ner

0.983

0.964

0.925

0.926

NERmembert-base-3entities

0.970

0.945

0.878

0

NERmembert-base-4entities

0.970

0.945

0.876

0.872

NERmembert-large-4entities (this model)

0.975

0.953

0.896

0.893
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner

Precision

0.986

0.962

0.925

0.943

0.998

0.992

Recall

0.987

0.969

0.951

0.933

0.997

0.992
F1
0.986

0.966

0.938

0.938

0.998

0.992

cmarkea/distilcamembert-base-ner

Precision

0.982

0.964

0.910

0.942

0.997

0.991

Recall

0.985

0.963

0.940

0.910

0.998

0.991
F1
0.983

0.964

0.925

0.926

0.997

0.991

NERmembert-base-3entities

Precision

0.971

0.947

0.866

0

X

X

Recall

0.969

0.943

0.891

0

X

X
F1
0.970

0.945

0.878

0

X

X

NERmembert-base-4entities

Precision

0.970

0.944

0.872

0.878

0.996

0.986

Recall

0.969

0.947

0.880

0.866

0.996

0.986
F1
0.970

0.945

0.876

0.872

0.996

0.986

NERmembert-large-4entities (this model)

Precision

0.975

0.957

0.872

0.901

0.997

0.989

Recall

0.975

0.949

0.922

0.884

0.997

0.989
F1
0.975

0.953

0.896

0.893

0.997

0.989
## Usage ### Code ```python from transformers import pipeline ner = pipeline('token-classification', model='CATIE-AQ/NERmembert-large-4entities', tokenizer='CATIE-AQ/NERmembert-large-4entities', aggregation_strategy="simple") results = ner( "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." ) print(result) ``` ```python ``` ### Try it through Space A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/NERmembert). ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:------:|:---------------:|:---------:|:------:|:------:|:--------:| | 0.0347 | 1.0 | 41095 | 0.0537 | 0.9832 | 0.9832 | 0.9832 | 0.9832 | | 0.0237 | 2.0 | 82190 | 0.0448 | 0.9858 | 0.9858 | 0.9858 | 0.9858 | | 0.0119 | 3.0 | 123285 | 0.0532 | 0.9860 | 0.9860 | 0.9860 | 0.9860 | ### Framework versions - Transformers 4.36.2 - Pytorch 2.1.2 - Datasets 2.16.1 - Tokenizers 0.15.0 ## Environmental Impact *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.* - **Hardware Type:** A100 PCIe 40/80GB - **Hours used:** 4h17min - **Cloud Provider:** Private Infrastructure - **Carbon Efficiency (kg/kWh):** 0.078 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of January 10, 2024.) - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.08 kg eq. CO2 ## Citations ### NERmembert-large-4entities ``` TODO ``` ### multiconer > @inproceedings{multiconer2-report, title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}}, author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin}, booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)}, year={2023}, publisher={Association for Computational Linguistics}} > @article{multiconer2-data, title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}}, author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin}, year={2023}} ### multinerd > @inproceedings{tedeschi-navigli-2022-multinerd, title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)", author = "Tedeschi, Simone and Navigli, Roberto", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-naacl.60", doi = "10.18653/v1/2022.findings-naacl.60", pages = "801--812"} ### pii-masking-200k > @misc {ai4privacy_2023, author = { {ai4Privacy} }, title = { pii-masking-200k (Revision 1d4c0a1) }, year = 2023, url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k }, doi = { 10.57967/hf/1532 }, publisher = { Hugging Face }} ### wikiner > @article{NOTHMAN2013151, title = {Learning multilingual named entity recognition from Wikipedia}, journal = {Artificial Intelligence}, volume = {194}, pages = {151-175}, year = {2013}, note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources}, issn = {0004-3702}, doi = {https://doi.org/10.1016/j.artint.2012.03.006}, url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276}, author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}} ### frenchNER_4entities ``` TODO ``` ### CamemBERT > @inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}} ## License [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)