|
--- |
|
language: |
|
- en |
|
- es |
|
- ja |
|
- el |
|
widget: |
|
- text: It is great to see athletes promoting awareness for climate change. |
|
datasets: |
|
- cardiffnlp/tweet_topic_multi |
|
- cardiffnlp/tweet_topic_multilingual |
|
license: mit |
|
metrics: |
|
- f1 |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# tweet-topic-base-multilingual |
|
|
|
This model is based on [cardiffnlp/twitter-xlm-roberta-base](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base) language model trained rained on ~198M multilingual tweets and finetuned for multi-label topic classification in English, Spanish, Japanese, and Greek. |
|
|
|
The models is trained using [TweetTopic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multi) and [X-Topic](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multilingual) datasets (see main [EMNLP 2024 reference paper](https://arxiv.org/abs/2410.03075)). |
|
|
|
|
|
|
|
<b>Labels</b>: |
|
|
|
|
|
| <span style="font-weight:normal">0: arts_&_culture</span> | <span style="font-weight:normal">5: fashion_&_style</span> | <span style="font-weight:normal">10: learning_&_educational</span> | <span style="font-weight:normal">15: science_&_technology</span> | |
|
|-----------------------------|---------------------|----------------------------|--------------------------| |
|
| 1: business_&_entrepreneurs | 6: film_tv_&_video | 11: music | 16: sports | |
|
| 2: celebrity_&_pop_culture | 7: fitness_&_health | 12: news_&_social_concern | 17: travel_&_adventure | |
|
| 3: diaries_&_daily_life | 8: food_&_dining | 13: other_hobbies | 18: youth_&_student_life | |
|
| 4: family | 9: gaming | 14: relationships | | |
|
|
|
|
|
## Full classification example |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification |
|
from transformers import AutoTokenizer |
|
import numpy as np |
|
from scipy.special import expit |
|
|
|
|
|
MODEL = f"cardiffnlp/tweet-topic-base-multilingual" |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL) |
|
|
|
# PT |
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL) |
|
class_mapping = model.config.id2label |
|
|
|
text = "It is great to see athletes promoting awareness for climate change." |
|
tokens = tokenizer(text, return_tensors='pt') |
|
output = model(**tokens) |
|
|
|
scores = output[0][0].detach().numpy() |
|
scores = expit(scores) |
|
predictions = (scores >= 0.5) * 1 |
|
|
|
|
|
# TF |
|
#tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL) |
|
#class_mapping = tf_model.config.id2label |
|
#text = "It is great to see athletes promoting awareness for climate change." |
|
#tokens = tokenizer(text, return_tensors='tf') |
|
#output = tf_model(**tokens) |
|
#scores = output[0][0] |
|
#scores = expit(scores) |
|
#predictions = (scores >= 0.5) * 1 |
|
|
|
# Map to classes |
|
for i in range(len(predictions)): |
|
if predictions[i]: |
|
print(class_mapping[i]) |
|
|
|
``` |
|
Output: |
|
|
|
``` |
|
news_&_social_concern |
|
sports |
|
``` |
|
|
|
## Results on X-Topic |
|
| | English | Spanish | Japanese | Greek | |
|
|--------------|---------|---------|----------|-------| |
|
| **Macro-F1** | 55.4 | 48.5 | 50.8 | 41.3 | |
|
| **Micro-F1** | 63.5 | 63.3 | 57.8 | 69.8 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
## BibTeX entry and citation info |
|
|
|
TBA |