--- license: mit base_model: camembert/camembert-large metrics: - precision - recall - f1 - accuracy model-index: - name: NERmembert-large-4entities results: [] datasets: - CATIE-AQ/frenchNER_4entities language: - fr widget: - text: "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan." library_name: transformers pipeline_tag: token-classification co2_eq_emissions: 80 --- # NERmembert-large-4entities ## Model Description We present **NERmembert-large-4entities**, which is a [CamemBERT large](https://huggingface.co/camembert/camembert-large) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC). All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities). There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing. Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/). ## Dataset The dataset used is [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities), which represents ~385k sentences labeled in 4 categories: | Label | Examples | |:------|:-----------------------------------------------------------| | PER | "La Bruyère", "Gaspard de Coligny", "Wittgenstein" | | ORG | "UTBM", "American Airlines", "id Software" | | LOC | "République du Cap-Vert", "Créteil", "Bordeaux" | | MISC | "Wolfenstein 3D", "Révolution française", "Coupe du monde" | The distribution of the entities is as follows:
Splits |
O |
PER |
LOC |
ORG |
MISC |
train |
7,539,692 |
307,144 |
286,746 |
127,089 |
799,494 |
---|---|---|---|---|---|
validation |
544,580 |
24,034 |
21,585 |
5,927 |
18,221 |
test |
720,623 |
32,870 |
29,683 |
7,911 |
21,760 |
Model |
PER |
LOC |
ORG |
MISC |
---|---|---|---|---|
Jean-Baptiste/camembert-ner |
0.971 |
0.947 |
0.902 |
0.663 |
cmarkea/distilcamembert-base-ner |
0.974 |
0.948 |
0.892 |
0.658 |
NERmembert-base-3entities |
0.978 |
0.957 |
0.904 |
0 |
NERmembert-base-4entities |
0.978 |
0.958 |
0.903 |
0.814 |
NERmembert-large-4entities (this model) |
0.982 |
0.964 |
0.919 |
0.834 |
Model |
Metrics |
PER |
LOC |
ORG |
MISC |
O |
Overall |
---|---|---|---|---|---|---|---|
Jean-Baptiste/camembert-ner |
Precision |
0.952 |
0.924 |
0.870 |
0.845 |
0.986 |
0.976 |
Recall |
0.990 |
0.972 |
0.938 |
0.546 |
0.992 |
0.976 |
|
F1 | 0.971 |
0.947 |
0.902 |
0.663 |
0.989 |
0.976 |
|
cmarkea/distilcamembert-base-ner |
Precision |
0.962 |
0.933 |
0.857 |
0.830 |
0.985 |
0.976 |
Recall |
0.987 |
0.963 |
0.930 |
0.545 |
0.993 |
0.976 |
|
F1 | 0.974 |
0.948 |
0.892 |
0.658 |
0.989 |
0.976 |
|
NERmembert-base-3entities |
Precision |
0.973 |
0.955 |
0.886 |
0 |
X |
X |
Recall |
0.983 |
0.960 |
0.923 |
0 |
X |
X |
|
F1 | 0.978 |
0.957 |
0.904 |
0 |
X |
X |
|
NERmembert-base-4entities |
Precision |
0.973 |
0.951 |
0.888 |
0.850 |
0.993 |
0.984 |
Recall |
0.983 |
0.964 |
0.918 |
0.781 |
0.993 |
0.984 |
|
F1 | 0.978 |
0.958 |
0.903 |
0.814 |
0.993 |
0.984 |
|
NERmembert-large-4entities (this model) |
Precision |
0.977 |
0.961 |
0.896 |
0.872 |
0.993 |
0.986 |
Recall |
0.987 |
0.966 |
0.943 |
0.798 |
0.995 |
0.986 |
|
F1 | 0.982 |
0.964 |
0.919 |
0.834 |
0.994 |
0.986 |
Model |
PER |
LOC |
ORG |
MISC |
---|---|---|---|---|
Jean-Baptiste/camembert-ner |
0.940 |
0.761 |
0.723 |
0.560 |
cmarkea/distilcamembert-base-ner |
0.921 |
0.748 |
0.694 |
0.530 |
NERmembert-base-3entities |
0.960 |
0.887 |
0.877 |
0 |
NERmembert-base-4entities |
0.960 |
0.890 |
0.867 |
0.852 |
NERmembert-large-4entities (this model) |
0.969 |
0.919 |
0.904 |
0.864 |
Model |
Metrics |
PER |
LOC |
ORG |
MISC |
O |
Overall |
---|---|---|---|---|---|---|---|
Jean-Baptiste/camembert-ner |
Precision |
0.908 |
0.717 |
0.753 |
0.620 |
0.936 |
0.889 |
Recall |
0.975 |
0.811 |
0.696 |
0.511 |
0.938 |
0.889 |
|
F1 | 0.940 |
0.761 |
0.723 |
0.560 |
0.937 |
0.889 |
|
cmarkea/distilcamembert-base-ner |
Precision |
0.885 |
0.738 |
0.737 |
0.589 |
0.928 |
0.881 |
Recall |
0.960 |
0.759 |
0.655 |
0.482 |
0.939 |
0.881 |
|
F1 | 0.921 |
0.748 |
0.694 |
0.530 |
0.934 |
0.881 |
|
NERmembert-base-3entities |
Precision |
0.957 |
0.894 |
0.876 |
0 |
X |
X |
Recall |
0.962 |
0.880 |
0.878 |
0 |
X |
X |
|
F1 | 0.960 |
0.887 |
0.877 |
0 |
X |
X |
|
NERmembert-base-4entities |
Precision |
0.954 |
0.893 |
0.851 |
0.849 |
0.979 |
0.954 |
Recall |
0.967 |
0.887 |
0.883 |
0.855 |
0.974 |
0.954 |
|
F1 | 0.960 |
0.890 |
0.867 |
0.852 |
0.977 |
0.954 |
|
NERmembert-large-4entities (this model) |
Precision |
0.964 |
0.922 |
0.904 |
0.856 |
0.981 |
0.961 |
Recall |
0.975 |
0.917 |
0.904 |
0.872 |
0.976 |
0.961 |
|
F1 | 0.969 |
0.919 |
0.904 |
0.864 |
0.978 |
0.961 |
Model |
PER |
LOC |
ORG |
MISC |
---|---|---|---|---|
Jean-Baptiste/camembert-ner |
0.962 |
0.934 |
0.888 |
0.419 |
cmarkea/distilcamembert-base-ner |
0.972 |
0.938 |
0.884 |
0.430 |
NERmembert-base-3entities |
0.985 |
0.973 |
0.938 |
0 |
NERmembert-base-4entities |
0.985 |
0.973 |
0.938 |
0.770 |
NERmembert-large-4entities (this model) |
0.987 |
0.976 |
0.948 |
0.790 |
Model |
Metrics |
PER |
LOC |
ORG |
MISC |
O |
Overall |
---|---|---|---|---|---|---|---|
Jean-Baptiste/camembert-ner |
Precision |
0.931 |
0.893 |
0.827 |
0.725 |
0.979 |
0.966 |
Recall |
0.994 |
0.980 |
0.959 |
0.295 |
0.990 |
0.966 |
|
F1 | 0.962 |
0.934 |
0.888 |
0.419 |
0.984 |
0.966 |
|
cmarkea/distilcamembert-base-ner |
Precision |
0.954 |
0.908 |
0.817 |
0.705 |
0.977 |
0.967 |
Recall |
0.991 |
0.969 |
0.963 |
0.310 |
0.990 |
0.967 |
|
F1 | 0.972 |
0.938 |
0.884 |
0.430 |
0.984 |
0.967 |
|
NERmembert-base-3entities |
Precision |
0.974 |
0.965 |
0.910 |
0 |
X |
X |
Recall |
0.995 |
0.981 |
0.968 |
0 |
X |
X |
|
F1 | 0.985 |
0.973 |
0.938 |
0 |
X |
X |
|
NERmembert-base-4entities |
Precision |
0.976 |
0.961 |
0.91 |
0.829 |
0.991 |
0.983 |
Recall |
0.994 |
0.985 |
0.967 |
0.719 |
0.993 |
0.983 |
|
F1 | 0.985 |
0.973 |
0.938 |
0.770 |
0.992 |
0.983 |
|
NERmembert-large-4entities (this model) |
Precision |
0.979 |
0.967 |
0.922 |
0.852 |
0.991 |
0.985 |
Recall |
0.996 |
0.986 |
0.974 |
0.736 |
0.994 |
0.985 |
|
F1 | 0.987 |
0.976 |
0.948 |
0.790 |
0.993 |
0.985 |
Model |
PER |
LOC |
ORG |
MISC |
---|---|---|---|---|
Jean-Baptiste/camembert-ner |
0.986 |
0.966 |
0.938 |
0.938 |
cmarkea/distilcamembert-base-ner |
0.983 |
0.964 |
0.925 |
0.926 |
NERmembert-base-3entities |
0.970 |
0.945 |
0.878 |
0 |
NERmembert-base-4entities |
0.970 |
0.945 |
0.876 |
0.872 |
NERmembert-large-4entities (this model) |
0.975 |
0.953 |
0.896 |
0.893 |
Model |
Metrics |
PER |
LOC |
ORG |
MISC |
O |
Overall |
---|---|---|---|---|---|---|---|
Jean-Baptiste/camembert-ner |
Precision |
0.986 |
0.962 |
0.925 |
0.943 |
0.998 |
0.992 |
Recall |
0.987 |
0.969 |
0.951 |
0.933 |
0.997 |
0.992 |
|
F1 | 0.986 |
0.966 |
0.938 |
0.938 |
0.998 |
0.992 |
|
cmarkea/distilcamembert-base-ner |
Precision |
0.982 |
0.964 |
0.910 |
0.942 |
0.997 |
0.991 |
Recall |
0.985 |
0.963 |
0.940 |
0.910 |
0.998 |
0.991 |
|
F1 | 0.983 |
0.964 |
0.925 |
0.926 |
0.997 |
0.991 |
|
NERmembert-base-3entities |
Precision |
0.971 |
0.947 |
0.866 |
0 |
X |
X |
Recall |
0.969 |
0.943 |
0.891 |
0 |
X |
X |
|
F1 | 0.970 |
0.945 |
0.878 |
0 |
X |
X |
|
NERmembert-base-4entities |
Precision |
0.970 |
0.944 |
0.872 |
0.878 |
0.996 |
0.986 |
Recall |
0.969 |
0.947 |
0.880 |
0.866 |
0.996 |
0.986 |
|
F1 | 0.970 |
0.945 |
0.876 |
0.872 |
0.996 |
0.986 |
|
NERmembert-large-4entities (this model) |
Precision |
0.975 |
0.957 |
0.872 |
0.901 |
0.997 |
0.989 |
Recall |
0.975 |
0.949 |
0.922 |
0.884 |
0.997 |
0.989 |
|
F1 | 0.975 |
0.953 |
0.896 |
0.893 |
0.997 |
0.989 |