File size: 2,471 Bytes
c78b4ec
813ae9b
 
 
 
 
 
 
 
c78b4ec
813ae9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f14de3
813ae9b
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
tags:
- sklearn
- text-classification
language:
- nl
metrics:
- accuracy
- hamming-loss
---


# Model card for NOS Drug-Related Text Classification on Telegram
The NOS editorial team is conducting an investigation into drug-related messages on Telegram. Thousands of Telegram messages has been labeled as drugs-related content (or not), as well including detail regarding the specific type of drugs, and delivery method. The data is utilized in order to train a model to scale it up and automatically label millions more.

## Methodology
Primarily a Logistic Regression model has been trained for binary classification. Text data was converted to numeric values using the Tfidf Vectorizer, considering term frequency-inverse document frequency (TF-IDF). This transformation enables the model to learn patterns and relationships between words. The model achieved 97% accuracy on the test set.
To take tasks with multiple possible labels into consideration, a MultiOutputClassifier was employed as an extension. This addresses the complexity of associating a text message with multiple categories such as "soft drugs," "hard drugs," and "medicines”. One-Hot Encoding was used for multi-label transformation.
Performance evaluation utilized Hamming Loss, a metric suitable for multi-label classification. The model demonstrated a Hamming Loss of 0.04, indicating 96% accuracy per label.

### Tools used to train the model
    • Python
    • scikit-learn
    • pandas
    • numpy
    
### How to Get Started with the Model

Use the code below to get started with the model.

```python
from joblib import load

# load the model
clf = load('model.joblib') 

# make some predictions

text_messages = [
    """
    Oud kleding te koop! Stuur een berichtje
    We repareren ook!
    """,
             
    """
    COKE/XTC
    * 1Gram = €50
    * 5Gram = €230
    """]

mapping = {0:"bezorging", 1:"bulk", 2:"designer", 3:"drugsad", 4:"geendrugsad", 5:"harddrugs", 6:"medicijnen", 7: "pickup", 8: "post", 9:"softdrugs"}

labels = []

for message in clf.predict(text_messages): 
    label = []
    for idx, labeled in enumerate(message):
        if labeled == 1:
            label.append(mapping[idx])
    labels.append(label)
    
    print(labels)

```

## Details
- **Shared by** Dutch Public Broadcasting Foundation (NOS)
- **Model type:** text-classification
- **Language:** Dutch
- **License:** Creative Commons Attribution Non Commercial No Derivatives 4.0