README.md · manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2024-1-1 at main

metadata

license: bigscience-openrail-m
widget:
  - text: >-
      We will restore funding to the Global Environment Facility and the
      Intergovernmental Panel on Climate Change.

Model description

An xlm-roberta-large model fine-tuned on ~1,7 million annotated statements contained in the Manifesto Corpus (version 2024a). The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme (Handbook 4). It works for all languages the xlm-roberta model is pretrained on (overview), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:


armenian	bosnian	bulgarian	catalan	croatian
czech	danish	dutch	english	estonian
finnish	french	galician	georgian	german
greek	hebrew	hungarian	icelandic	italian
japanese	korean	latvian	lithuanian	macedonian
montenegrin	norwegian	polish	portuguese	romanian
russian	serbian	slovak	slovenian	spanish
swedish	turkish	ukrainian

How to use

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2024-1-1")
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "We will restore funding to the Global Environment Facility and the Intergovernmental Panel on Climate Change, to support critical climate science research around the world"

inputs = tokenizer(sentence,
                   return_tensors="pt",
                   max_length=200,  #we limited the input to 200 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'501 - Environmental Protection: Positive': 67.56, '411 - Technology and Infrastructure': 14.03, '107 - Internationalism: Positive': 13.58, '416 - Anti-Growth Economy: Positive': 2.24...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 501 - Environmental Protection: Positive

Model Performance

The model was evaluated on a test set of 200,920 annotated manifesto statements.

Overall

	Accuracy	Top2_Acc	Top3_Acc	Precision	Recall	F1_Macro	MCC	Cross-Entropy
Sentence Model	0.57	0.73	0.81	0.48	0.43	0.45	0.55	1.47
Context Model	0.64	0.81	0.88	0.55	0.52	0.53	0.63	1.15

Citation

Please cite the model as follows:

Burst, Tobias / Lehmann, Pola / Franzmann, Simon / Al-Gaddooa, Denise / Ivanusch, Christoph / Regel, Sven / Riethmüller, Felicia / Weßels, Bernhard / Zehnter, Lisa (2024): manifestoberta. Version 56topics.sentence.2024.1.1. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB) / Göttingen: Institut für Demokratieforschung (IfDem). https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1

@misc{Burst:2024,
  Address = {Berlin / Göttingen},
  Author = {Burst, Tobias AND Lehmann, Pola AND Franzmann, Simon AND Al-Gaddooa, Denise AND Ivanusch, Christoph AND Regel, Sven AND Riethmüller, Felicia AND Weßels, Bernhard AND Zehnter, Lisa},
  Publisher = {Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung},
  Title = {manifestoberta. Version 56topics.sentence.2024.1.1},
  doi = {10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1},
  url = {https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1},      
  Year = {2024},