Spam detection of Tweets
This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0).
Training Dataset
This was finetuned on the UtkMl's Twitter Spam Detection dataset with FacebookAI/xlm-roberta-large
as the base model.
How to use model
Here is some starter code that you can use to detect spam tweets from a dataset of text-based tweets.
def classify_texts(df, text_col, model_path="cja5553/xlm-roberta-Twitter-spam-classification", batch_size=24):
'''
Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.
Parameters:
-----------
df : pandas.DataFrame
DataFrame containing the texts to classify.
text_col : str
Name of the column in that contains the text data to be classified.
model_path : str, default="cja5553/xlm-roberta-Twitter-spam-classification"
Path to the pre-trained model for sequence classification.
batch_size : int, optional, default=24
Batch size for loading and processing data in batches. Adjust based on available GPU memory.
Returns:
--------
pandas.DataFrame
The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.
'''
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
model.eval() # Set model to evaluation mode
# Prepare the text data for classification
df["text"] = df[text_col].astype(str) # Ensure text is in string format
# Convert the data to a Hugging Face Dataset and tokenize
text_dataset = Dataset.from_pandas(df)
def tokenize_function(example):
return tokenizer(
example["text"],
padding="max_length",
truncation=True,
max_length=512
)
text_dataset = text_dataset.map(tokenize_function, batched=True)
text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
# DataLoader for the text data
text_loader = DataLoader(text_dataset, batch_size=batch_size)
# Make predictions
predictions = []
with torch.no_grad():
for batch in tqdm_notebook(text_loader):
input_ids = batch['input_ids'].to("cuda")
attention_mask = batch['attention_mask'].to("cuda")
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
preds = torch.argmax(logits, dim=-1).cpu().numpy() # Get predicted labels
predictions.extend(preds)
# Map predictions to labels
id2label = {0: "Quality", 1: "Spam"}
predicted_labels = [id2label[pred] for pred in predictions]
# Add predictions to the original DataFrame
df["spam_prediction"] = predicted_labels
return df
spam_df_classification = classify_texts(df, "text_col")
print(spam_df_classification)
Metrics
Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
- Accuracy: 0.974555
- Precision: 0.97457
- Recall: 0.97455
- F1-Score: 0.97455
Questions?
contact me at alba@wustl.edu
- Downloads last month
- 14
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for cja5553/xlm-roberta-Twitter-spam-classification
Base model
FacebookAI/xlm-roberta-large