Model card for NOS Drug-Related Text Classification on Telegram

The NOS editorial team is conducting an investigation into drug-related messages on Telegram. Thousands of Telegram messages has been labeled as drugs-related content (or not), as well including detail regarding the specific type of drugs, and delivery method. The data is utilized in order to train a model to scale it up and automatically label millions more.

Methodology

Primarily a Logistic Regression model has been trained for binary classification. Text data was converted to numeric values using the Tfidf Vectorizer, considering term frequency-inverse document frequency (TF-IDF). This transformation enables the model to learn patterns and relationships between words. The model achieved 97% accuracy on the test set. To take tasks with multiple possible labels into consideration, a MultiOutputClassifier was employed as an extension. This addresses the complexity of associating a text message with multiple categories such as "soft drugs," "hard drugs," and "medicines”. One-Hot Encoding was used for multi-label transformation. Performance evaluation utilized Hamming Loss, a metric suitable for multi-label classification. The model demonstrated a Hamming Loss of 0.04, indicating 96% accuracy per label.

Tools used to train the model

• Python
• scikit-learn
• pandas
• numpy

How to Get Started with the Model

Use the code below to get started with the model.

from joblib import load

# load the model
clf = load('model.joblib') 

# make some predictions

text_messages = [
    """
    Oud kleding te koop! Stuur een berichtje
    We repareren ook!
    """,
             
    """
    COKE/XTC
    * 1Gram = €50
    * 5Gram = €230
    """]

mapping = {0:"bezorging", 1:"bulk", 2:"designer", 3:"drugsad", 4:"geendrugsad", 5:"harddrugs", 6:"medicijnen", 7: "pickup", 8: "post", 9:"softdrugs"}

labels = []

for message in clf.predict(text_messages): 
    label = []
    for idx, labeled in enumerate(message):
        if labeled == 1:
            label.append(mapping[idx])
    labels.append(label)
    
print(labels)

Details

  • Shared by Dutch Public Broadcasting Foundation (NOS)
  • Model type: text-classification
  • Language: Dutch
  • License: Creative Commons Attribution Non Commercial No Derivatives 4.0
Downloads last month
0
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.