TweetTopic
Collection
Dataset and models associated with the TweetTopic datasets (tweet classification)
•
11 items
•
Updated
•
1
This is a RoBERTa-base model trained on 168.86M tweets until the end of September 2022 and finetuned for single-label topic classification on a corpus of 6,997 tweets. The original RoBERTa-base model can be found here. This model is suitable for English.
Labels:
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
MODEL = f"cardiffnlp/tweet-topic-latest-single"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label
text = "Tesla stock is on the rise!"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
# TF
#model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = model.config.id2label
#text = "Tesla stock is on the rise!"
#encoded_input = tokenizer(text, return_tensors='tf')
#output = model(**encoded_input)
#scores = output[0][0]
#scores = softmax(scores)
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
l = class_mapping[ranking[i]]
s = scores[ranking[i]]
print(f"{i+1}) {l} {np.round(float(s), 4)}")
Output:
1) business_&_entrepreneurs 0.8929
2) sports_&_gaming 0.0478
3) science_&_technology 0.0185
4) daily_life 0.0178
5) arts_&_culture 0.0128
6) pop_culture 0.0102