COVID-19 Misinformation Detection Tool for YouTube Videos

This model is a fine-tuned version of the DeBerta-v3-large model trained to detect COVID-19 misinformation in YouTube videos.

Model Description

Given the YouTube video metadata (e.g., title, description, transcript, tags), the model will predict three potential numeric labels: opposing COVID-19 misinformation (0), neutral information (1), and supporting COVID-19 misinformation (2).

To learn more about these labels, please refer to the paper: Algorithmic Behaviors Across Regions: A Geolocation Audit of YouTube Search for COVID-19 Misinformation between the United States and South Africa. The video dataset used to train and evaluate the model is available at the Github link here.

Training Hyperparameters

The following hyperparameters were used during training:

OPTIMIZER: Adam optimizer with cross-entropy loss function
LEARNING_RATE = 5e-6
TRAIN_BATCH_SIZE = 4
WEIGHT_DECAY= 1e-04
VALIDATION_BATCH_SIZE = 4
TEST_BATCH_SIZE = 4
NUM_EPOCHS = 5
MIN_SAVE_EPOCH = 2

The dataset was split 80-10-10 across the train (N=2180), validation (N=272), and test set (N=273). The model was fine-tuned on a single NVIDIA A40 GPU.

How to Get Started with the Model

To get started, you should initialize the model using AutoTokenizer and AutoModelForSequenceClassification classes. For the tokenizer, set "use_fast" parameter to False, the max_len to 1024, padding to "max_length," and truncation to True. For the model, set the "num_labels" parameter to 3.

Next, with a YouTube video dataset with metadata, please concatenate each video's title, description, transcripts, and tags in the following manner:

input = 'VIDEO TITLE: ' + title + '\nVIDEO DESCRIPTION: ' + description + '\nVIDEO TRANSCRIPT: ' + transcript + '\nVIDEO TAGS: ' + tags

Thus, each video in your dataset should have its input metadata formatted in the structure above. Finally, run the input into a tokenizer and feed the tokenized input into the model to obtain one of three predicted labels. Use the logit function to obtain the label:

_, pred_idx = outputs.logits.max(dim=1)

Training Data

The video dataset used to train and evaluate the model is available at the Github link here.

To summarize, the dataset was annotated by Amazon Mechanical Turk (AMT) workers and the paper's authors. Please refer to the paper for more information on the training data and its annotation process.

The videos in the dataset were labeled along the following 7 classes: "Opposing COVID-19 Misinformation (-1),' "Neutral COVID-19 Information (0)," "Supporting COVID-19 Misinformation (1)," "On the COVID-19 origins in Wuhan, China (2)," "Irrelevant (3)," "Video in a language other than English (4)," and "URL not accessible (5)" within the dataset. However, as explained in the paper, we normalized the 7 classes to 3 classes based on their stance on COVID-19 misinformation: supporting, neutral, and opposing (see subsection "Consolidating from 5-classes to 3-classes" in the paper for more information).

Since the classifier's pred_idx can only be non-negative, we adjusted the 3-point annotation labels for the classifier by adding one. Thus, the classifier will output the following label values: opposing COVID-19 misinformation (0), neutral (1), and supporting COVID-19 misinformation (2).

Results

The model achieved an accuracy, weighted F1-score, and macro F1-score of 0.85 on the test set.

Citation

If you used this model or the dataset in the Github in your research, please cite our work at:

@misc{jung2024algorithmicbehaviorsregionsgeolocation,
      title={Algorithmic Behaviors Across Regions: A Geolocation Audit of YouTube Search for COVID-19 Misinformation between the United States and South Africa}, 
      author={Hayoung Jung and Prerna Juneja and Tanushree Mitra},
      year={2024},
      eprint={2409.10168},
      archivePrefix={arXiv},
      primaryClass={cs.CY},
      url={https://arxiv.org/abs/2409.10168}, 
}