File size: 2,075 Bytes
ef61d77
 
 
 
59b640d
 
3ed4c7d
 
ef61d77
 
 
59b640d
ef61d77
a40f15e
59b640d
3534230
 
 
59b640d
 
 
 
ef61d77
59b640d
 
ef61d77
59b640d
 
ef61d77
 
 
59b640d
 
ef61d77
 
59b640d
ef61d77
 
59b640d
ef61d77
59b640d
ef61d77
59b640d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
tags:
- sentence-transformers
- transformers
- SetFit
- News
datasets: KnutJaegersberg/News_topics_IPTC_codes_long
pipeline_tag: text-classification
---


# IPTC topic classifier (multilingual)

A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories. 
The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics. 
The dataset I used to train the model is based on this file: 
https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long

Accuracy on highest level labels in eval: 
0.9779412
Accuracy/F1/mcc on mid level labels in eval: 
0.6992481/0.6666667/0.6992617

More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance. 
https://www.kaggle.com/datasets/rmisra/news-category-dataset

While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category. 
My current impression is this system is useful for the aimed for purpose. 



The numeric categories can be joined with the labels by using this table: 
https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels


Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that. 


Use like any other SetFit model 

from setfit import SetFitModel

# Download from Hub and run inference
model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml")
# Run inference
preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"])