nosdigitalmedia
/

telegram-drugs-classification

Text Classification

Model card Files Files and versions Community

telegram-drugs-classification / README.md

Rijgersberg's picture

Fix printing: now print labels of every message

9f14de3 verified 11 months ago

|

2.47 kB

	---
	tags:
	- sklearn
	- text-classification
	language:
	- nl
	metrics:
	- accuracy
	- hamming-loss
	---


	# Model card for NOS Drug-Related Text Classification on Telegram
	The NOS editorial team is conducting an investigation into drug-related messages on Telegram. Thousands of Telegram messages has been labeled as drugs-related content (or not), as well including detail regarding the specific type of drugs, and delivery method. The data is utilized in order to train a model to scale it up and automatically label millions more.

	## Methodology
	Primarily a Logistic Regression model has been trained for binary classification. Text data was converted to numeric values using the Tfidf Vectorizer, considering term frequency-inverse document frequency (TF-IDF). This transformation enables the model to learn patterns and relationships between words. The model achieved 97% accuracy on the test set.
	To take tasks with multiple possible labels into consideration, a MultiOutputClassifier was employed as an extension. This addresses the complexity of associating a text message with multiple categories such as "soft drugs," "hard drugs," and "medicines”. One-Hot Encoding was used for multi-label transformation.
	Performance evaluation utilized Hamming Loss, a metric suitable for multi-label classification. The model demonstrated a Hamming Loss of 0.04, indicating 96% accuracy per label.

	### Tools used to train the model
	• Python
	• scikit-learn
	• pandas
	• numpy

	### How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from joblib import load

	# load the model
	clf = load('model.joblib')

	# make some predictions

	text_messages = [
	"""
	Oud kleding te koop! Stuur een berichtje
	We repareren ook!
	""",

	"""
	COKE/XTC
	* 1Gram = €50
	* 5Gram = €230
	"""]

	mapping = {0:"bezorging", 1:"bulk", 2:"designer", 3:"drugsad", 4:"geendrugsad", 5:"harddrugs", 6:"medicijnen", 7: "pickup", 8: "post", 9:"softdrugs"}

	labels = []

	for message in clf.predict(text_messages):
	label = []
	for idx, labeled in enumerate(message):
	if labeled == 1:
	label.append(mapping[idx])
	labels.append(label)

	print(labels)

	```

	## Details
	- Shared by Dutch Public Broadcasting Foundation (NOS)
	- Model type: text-classification
	- Language: Dutch
	- License: Creative Commons Attribution Non Commercial No Derivatives 4.0