Librarian Bot: Update dataset YAML metadata for model (#1)

fe1fb72 over 1 year ago

2.08 kB

	---
	tags:
	- sentence-transformers
	- transformers
	- SetFit
	- News
	datasets: KnutJaegersberg/News_topics_IPTC_codes_long
	pipeline_tag: text-classification
	---


	# IPTC topic classifier (multilingual)

	A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories.
	The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics.
	The dataset I used to train the model is based on this file:
	https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long

	Accuracy on highest level labels in eval:
	0.9779412
	Accuracy/F1/mcc on mid level labels in eval:
	0.6992481/0.6666667/0.6992617

	More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance.
	https://www.kaggle.com/datasets/rmisra/news-category-dataset

	While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category.
	My current impression is this system is useful for the aimed for purpose.



	The numeric categories can be joined with the labels by using this table:
	https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels


	Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that.


	Use like any other SetFit model

	from setfit import SetFitModel

	# Download from Hub and run inference
	model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml")
	# Run inference
	preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"])