Text Classification
PyTorch
Safetensors
English
eurovoc
Inference Endpoints
File size: 9,537 Bytes
7699d98
 
 
 
 
 
 
 
3b6d997
7699d98
 
3b6d997
7699d98
3b6d997
7699d98
 
3b6d997
7699d98
 
3b6d997
7699d98
 
 
 
428ac08
7699d98
 
 
 
b7e02de
4d8049f
b7e02de
4d8049f
7699d98
b7e02de
7699d98
 
 
 
4d8049f
 
 
 
 
 
 
41a3b80
 
 
 
 
 
 
 
4d8049f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7699d98
 
23073a8
7699d98
4d8049f
 
 
 
c26918d
3b6d997
c26918d
 
 
 
4d8049f
7699d98
 
 
 
 
 
4d8049f
7699d98
 
 
6d15c3f
14bc646
 
 
 
3b6d997
 
 
 
7699d98
4c75708
 
 
 
 
 
 
 
 
 
7699d98
 
 
 
4c75708
7699d98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: eupl-1.1
datasets:
- EuropeanParliament/cellar_eurovoc
language:
- en
metrics:
  - type: f1         
    value: 0.8345 
    name: micro F1
    args:
      threshold: 0.46
  - type: NDCG@3         
    value: 0.8819  
    name: NDCG@5
  - type: NDCG@5         
    value: 0.8689 
    name: NDCG@5
  - type: NDCG@10         
    value: 0.8780 
    name: NDCG@10
tags:
- eurovoc
pipeline_tag: text-classification

widget:
- text: "The Union condemns the continuing grave human rights violations by the Myanmar armed forces, including torture, sexual and gender-based violence, the persecution of civil society actors, human rights defenders and journalists, and attacks on the civilian population, including ethnic and religious minorities."
 
---

# Eurovoc Multilabel Classifer 🇪🇺

[EuroVoc](https://op.europa.eu/fr/web/eu-vocabularies) is a large multidisciplinary multilingual (24 languages of 🇪🇺) hierarchical thesaurus of more than 7000 classes covering the activities of EU institutions.
Given the number of legal documents produced every day and the huge mass of pre-existing documents to be classified high quality automated or semi-automated classification methods are most welcome in this domain.

This model based on BERT Deep Neural Network was trained on more than 3, 200,000 documents to achieve that task and is used in a production environment via the huggingface inference endpoint.
This model support the 24 languages of the European Union.


## Examples

In English 🇬🇧 :

```
text = "The Union condemns the continuing grave human rights violations by the Myanmar armed forces, including torture, sexual and gender-based violence, the persecution of civil society actors, human rights defenders and journalists, and attacks on the civilian population, including ethnic and religious minorities."


human rights	0.984
ethnic group	0.9743
Burma/Myanmar	0.9727
protection of minorities	0.9586
religious discrimination	0.6038
ethnic discrimination	0.5834
political violence	0.5828
```

In French 🇫🇷: 

```
text = "En juillet 2023, la Commission a présenté un paquet de propositions pour l'écologisation du transport de marchandises. Parmi les trois propositions, l'une porte sur l'amélioration de l'utilisation des capacités de l'infrastructure ferroviaire. Le texte proposé comprend des modifications des règles relatives à la planification et à la répartition des capacités d'infrastructure ferroviaire, actuellement couvertes par la directive 2012/34/UE et le règlement (UE) n° 913/2010. L'objectif de ces modifications est de permettre une gestion plus efficace des capacités de l'infrastructure ferroviaire et du trafic, afin d'améliorer la qualité des services et d'optimiser l'utilisation du réseau ferroviaire, d'accueillir des volumes de trafic plus importants et de veiller à ce que le secteur des transports contribue à la décarbonisation."

transport infrastructure 0.998161256313324
rail network 0.9951391220092773
common transport policy 0.9791265726089478
transport market 0.9368429780006409
trans-European network 0.9098047614097595
high-speed transport 0.4887568950653076
carriage of goods 0.4874659776687622
```

In German 🇩🇪: 

```
text = "Am 14. September 2022 schlug die Kommission eine Verordnung zum Verbot von Produkten, die unter Einsatz von Zwangsarbeit, einschließlich Kinderarbeit, hergestellt wurden, auf dem Binnenmarkt der Europäischen Union (EU) vor. Der Vorschlag bezieht sich auf alle Produkte, die auf dem EU-Markt angeboten werden, unabhängig davon, ob sie in der EU für den Inlandsverbrauch oder für die Ausfuhr hergestellt oder eingeführt werden. Er gilt für Produkte aller Art, einschließlich ihrer Bestandteile, aus allen Sektoren und Branchen. Die EU-Mitgliedstaaten wären für die Durchsetzung der Bestimmungen zuständig, und ihre nationalen Behörden könnten Produkte, die unter Einsatz von Zwangsarbeit hergestellt wurden, vom EU-Markt nehmen. Die Zollbehörden würden solche Produkte an den EU-Grenzen identifizieren und aufhalten. "

goods and services 0.9618138670921326
single market 0.9268659949302673
market approval 0.6425430774688721
export restriction 0.5231644511222839
EU Member State 0.4724983870983124
free movement of goods 0.38777536153793335
electronic commerce 0.31897953152656555
```

In Bulgarian 🇧🇬:

```
text = "В тази кратка бележка се обобщава проучването, в което се оценяват предизвикателствата, възможностите и средносрочните перспективи пред млечния сектор в ЕС в светлината на премахването на квотите за мляко. Проучването се фокусира върху структурните промени в сектора, динамиката на пазара на млечни продукти, необходимостта от екологична устойчивост и устойчивостта на селските райони. Разгледани са и специфичните проблеми на млечните региони в неравностойно положение. Докладът предлага политически препоръки за разглеждане от Европейския парламент с цел ефективно подпомагане на млечното животновъдство и поддържане на селските общности, като същевременно се отговори на изискванията за устойчивост на сектора."

reform of the CAP 0.38253700733184814
milk 0.35211247205734253
milk product 0.2761436402797699
agricultural quota 0.24940797686576843
dairy production 0.2132476419210434
EU Member State 0.09408465027809143
```


## Architecture

![architecture](architecture.png)

This classification model is built on top of [EUBERT](https://huggingface.co/EuropeanParliament/EUBERT) with 7331 Eurovoc labels

With less than 100 million parameters, it can be deployed on commodity hardware without GPU acceleration (around 200 ms per inference for 2000 characters).

Parameters :
- Number of epochs 16
- Batch size  10
- Max lenght 512
- Learning Rate 5e-05

## Usage


```python
from eurovoc import EurovocTagger
model = EurovocTagger.from_pretrained("EuropeanParliament/eurovoc_eu")
```
see the source code also
## Metrics


On Eurovoc Dataset version 23.08 with a stratification ratio 90/10 for training/test and training/validation 


| Metric     | Value      | Threshold Value |
|------------|------------|-----------------|
| Micro F1   | 0.8345     | 0.46            |
| NDCG@3     | 0.8819     | -               |
| NDCG@5     | 0.8689     | -               |
| NDCG@10    | 0.8780     | -               |

These values are higher than the state of the art previously known in the field, see publications:

- Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2019. [Extreme Multi-Label Legal Text Classification](https://arxiv.org/abs/1905.10892): A Case Study in EU Legislation. In Proceedings of the Natural Legal Language Processing Workshop 2019, pages 78–87, Minneapolis, Minnesota. Association for Computational Linguistics.
- I. Chalkidis, M. Fergadiotis, P. Malakasiotis and I. Androutsopoulos, "[Large-Scale Multi-Label Text Classification on EU Legislation](https://arxiv.org/abs/1906.02192)". Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, (short papers), 2019 ()
- Andrei-Marius Avram, Vasile Pais, and Dan Ioan Tufis. 2021. [PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors](https://arxiv.org/abs/2108.01139). In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 92–101, Held Online. INCOMA Ltd..
- SHAHEEN, Zein, WOHLGENANNT, Gerhard, et FILTZ, Erwin. [Large Scale Legal Text Classification Using Transformer Models](https://arxiv.org/pdf/2010.12871.pdf)


These results make this model the de facto new reference in the domain.
As the model is open, we encourage you to carry out your own evaluations and share them on the [discussion forum](https://huggingface.co/EuropeanParliament/eurovoc_eu/discussions)


## Inference Endpoint


### Payload example 

```json 
{
  "inputs": "The Union condemns the continuing grave human rights violations by the Myanmar armed forces, including torture, sexual and gender-based violence, the persecution of civil society actors, human rights defenders and journalists, and attacks on the civilian population, including ethnic and religious minorities. ",
  "topk": 10,
  "threshold": 0.16
}

```

result: 

```json 
{'results': [{'label': 'international sanctions', 'score': 0.9994925260543823},
             {'label': 'economic sanctions', 'score': 0.9991770386695862},
             {'label': 'natural person', 'score': 0.9591936469078064},
             {'label': 'EU restrictive measure', 'score': 0.8388392329216003},
             {'label': 'legal person', 'score': 0.45630475878715515},
             {'label': 'Burma/Myanmar', 'score': 0.43375277519226074}]}
```

Only six results, because the following one score is less that 0.16

Default value, topk = 5 and threshold = 0.16


## Author(s)

Sébastien Campion <sebastien.campion@europarl.europa.eu>