antypasd's picture
Update README.md
0f13b53
|
raw
history blame
2.24 kB

tweet-topic-19-single

This is a roBERTa-base model trained on ~90m tweets until the end of 2019 (see here), and finetuned for single-label topic classification on a corpus of 6,997 tweets. The original roBERTa-base model can be found here and the original reference paper is TweetEval. This model is suitable for English.

Labels:

  • 0 -> arts_&_culture;
  • 1 -> business_&_entrepreneurs;
  • 2 -> pop_culture;
  • 3 -> daily_life;
  • 4 -> sports_&_gaming;
  • 5 -> science_&_technology

Full classification example

from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax

    
MODEL = f"cardiffnlp/tweet-topic-19-single"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label

text = "Tesla stock is on the rise!"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

# TF
#model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = model.config.id2label
#text = "Tesla stock is on the rise!"
#encoded_input = tokenizer(text, return_tensors='tf')
#output = model(**encoded_input)
#scores = output[0][0]
#scores = softmax(scores)


ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = class_mapping[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Output:

1) business_&_entrepreneurs 0.8575
2) science_&_technology 0.0604
3) pop_culture 0.0295
4) daily_life 0.0217
5) sports_&_gaming 0.0154
6) arts_&_culture 0.0154