|
--- |
|
tags: |
|
- sklearn |
|
- text-classification |
|
language: |
|
- nl |
|
metrics: |
|
- accuracy |
|
- hamming-loss |
|
--- |
|
|
|
|
|
# Model card for NOS Drug-Related Text Classification on Telegram |
|
The NOS editorial team is conducting an investigation into drug-related messages on Telegram. Thousands of Telegram messages has been labeled as drugs-related content (or not), as well including detail regarding the specific type of drugs, and delivery method. The data is utilized in order to train a model to scale it up and automatically label millions more. |
|
|
|
## Methodology |
|
Primarily a Logistic Regression model has been trained for binary classification. Text data was converted to numeric values using the Tfidf Vectorizer, considering term frequency-inverse document frequency (TF-IDF). This transformation enables the model to learn patterns and relationships between words. The model achieved 97% accuracy on the test set. |
|
To take tasks with multiple possible labels into consideration, a MultiOutputClassifier was employed as an extension. This addresses the complexity of associating a text message with multiple categories such as "soft drugs," "hard drugs," and "medicines”. One-Hot Encoding was used for multi-label transformation. |
|
Performance evaluation utilized Hamming Loss, a metric suitable for multi-label classification. The model demonstrated a Hamming Loss of 0.04, indicating 96% accuracy per label. |
|
|
|
### Tools used to train the model |
|
• Python |
|
• scikit-learn |
|
• pandas |
|
• numpy |
|
|
|
### How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from joblib import load |
|
|
|
# load the model |
|
clf = load('model.joblib') |
|
|
|
# make some predictions |
|
|
|
text_messages = [ |
|
""" |
|
Oud kleding te koop! Stuur een berichtje |
|
We repareren ook! |
|
""", |
|
|
|
""" |
|
COKE/XTC |
|
* 1Gram = €50 |
|
* 5Gram = €230 |
|
"""] |
|
|
|
mapping = {0:"bezorging", 1:"bulk", 2:"designer", 3:"drugsad", 4:"geendrugsad", 5:"harddrugs", 6:"medicijnen", 7: "pickup", 8: "post", 9:"softdrugs"} |
|
|
|
labels = [] |
|
|
|
for message in clf.predict(text_messages): |
|
label = [] |
|
for idx, labeled in enumerate(message): |
|
if labeled == 1: |
|
label.append(mapping[idx]) |
|
labels.append(label) |
|
|
|
print(labels) |
|
|
|
``` |
|
|
|
## Details |
|
- **Shared by** Dutch Public Broadcasting Foundation (NOS) |
|
- **Model type:** text-classification |
|
- **Language:** Dutch |
|
- **License:** Creative Commons Attribution Non Commercial No Derivatives 4.0 |