|
--- |
|
tags: |
|
- sentence-transformers |
|
- transformers |
|
- SetFit |
|
- News |
|
datasets: KnutJaegersberg/News_topics_IPTC_codes_long |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
|
|
# IPTC topic classifier (multilingual) |
|
|
|
A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories. |
|
The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics. |
|
The dataset I used to train the model is based on this file: |
|
https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long |
|
|
|
Accuracy on highest level labels in eval: |
|
0.9779412 |
|
Accuracy/F1/mcc on mid level labels in eval: |
|
0.6992481/0.6666667/0.6992617 |
|
|
|
More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance. |
|
https://www.kaggle.com/datasets/rmisra/news-category-dataset |
|
|
|
While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category. |
|
My current impression is this system is useful for the aimed for purpose. |
|
|
|
|
|
|
|
The numeric categories can be joined with the labels by using this table: |
|
https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels |
|
|
|
|
|
Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that. |
|
|
|
|
|
Use like any other SetFit model |
|
|
|
from setfit import SetFitModel |
|
|
|
# Download from Hub and run inference |
|
model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml") |
|
# Run inference |
|
preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"]) |