File size: 3,561 Bytes
63227f2
dc1492d
39d5c00
80eb0be
 
 
63227f2
beac85f
5b4edfe
e0d7cc4
 
 
 
0c71caf
 
 
 
 
 
 
 
 
 
5b4edfe
 
 
 
 
 
c3b4bd1
5b4edfe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
beac85f
 
 
 
 
 
73051b7
 
c3b4bd1
80eb0be
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: bigscience-openrail-m
widget:
- text: >-
    We will restore funding to the Global Environment Facility and the
    Intergovernmental Panel on Climate Change.
---

## Model description
An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [Manifesto Corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a). 
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
It works for all languages the xlm-roberta model is pretrained on ([overview](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#introduction)), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:

||||||
|------|------|------|------|------|
|armenian|bosnian|bulgarian|catalan|croatian|
|czech|danish|dutch|english|estonian|
|finnish|french|galician|georgian|german|
|greek|hebrew|hungarian|icelandic|italian|
|japanese|korean|latvian|lithuanian|macedonian|
|montenegrin|norwegian|polish|portuguese|romanian|
|russian|serbian|slovak|slovenian|spanish|
|swedish|turkish|ukrainian| | |

## How to use

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1")
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "We will restore funding to the Global Environment Facility and the Intergovernmental Panel on Climate Change, to support critical climate science research around the world"

inputs = tokenizer(sentence,
                   return_tensors="pt",
                   max_length=200,  #we limited the input to 200 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'501 - Environmental Protection: Positive': 67.28, '411 - Technology and Infrastructure': 15.19, '107 - Internationalism: Positive': 13.63, '416 - Anti-Growth Economy: Positive': 2.02...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 501 - Environmental Protection: Positive
```


## Model Performance

The model was evaluated on a test set of 199,046 annotated manifesto statements.

### Overall

|                                                                                                       | Accuracy | Top2_Acc | Top3_Acc | Precision| Recall | F1_Macro | MCC | Cross-Entropy |
|-------------------------------------------------------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:------:|:--------:|:---:|:-------------:|
[Sentence Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1)|   0.57   |   0.73   |	  0.81   |	  0.49  |  0.43  |	 0.45   | 0.55|	     1.5      |
[Context Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1)  |   0.64   |   0.81   |   0.88   |    0.54  |  0.52  |   0.53   | 0.62|      1.15     |