Text Classification
Safetensors
xlm-roberta
Pedrada's picture
Add paper
c0d774f verified
|
raw
history blame
3.17 kB
metadata
language:
  - en
  - es
  - ja
  - el
widget:
  - text: It is great to see athletes promoting awareness for climate change.
datasets:
  - cardiffnlp/tweet_topic_multi
  - cardiffnlp/tweet_topic_multilingual
license: mit
metrics:
  - f1
pipeline_tag: text-classification

tweet-topic-base-multilingual

This model is based on cardiffnlp/twitter-xlm-roberta-base language model trained rained on ~198M multilingual tweets and finetuned for multi-label topic classification in English, Spanish, Japanese, and Greek.

The models is trained using TweetTopic and X-Topic datasets (see main EMNLP 2024 reference paper).

Labels:

0: arts_&_culture 5: fashion_&_style 10: learning_&_educational 15: science_&_technology
1: business_&_entrepreneurs 6: film_tv_&_video 11: music 16: sports
2: celebrity_&_pop_culture 7: fitness_&_health 12: news_&_social_concern 17: travel_&_adventure
3: diaries_&_daily_life 8: food_&_dining 13: other_hobbies 18: youth_&_student_life
4: family 9: gaming 14: relationships

Full classification example

from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import expit

    
MODEL = f"cardiffnlp/tweet-topic-base-multilingual"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label

text = "It is great to see athletes promoting awareness for climate change."
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)

scores = output[0][0].detach().numpy()
scores = expit(scores)
predictions = (scores >= 0.5) * 1


# TF
#tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = tf_model.config.id2label
#text = "It is great to see athletes promoting awareness for climate change."
#tokens = tokenizer(text, return_tensors='tf')
#output = tf_model(**tokens)
#scores = output[0][0]
#scores = expit(scores)
#predictions = (scores >= 0.5) * 1

# Map to classes
for i in range(len(predictions)):
  if predictions[i]:
    print(class_mapping[i])

Output:

news_&_social_concern
sports

Results on X-Topic

English Spanish Japanese Greek
Macro-F1 55.4 48.5 50.8 41.3
Micro-F1 63.5 63.3 57.8 69.8

BibTeX entry and citation info

TBA