samanjoy2's picture
Update README.md
d27c01b
|
raw
history blame
5.32 kB
metadata
license: apache-2.0
language:
  - bn
library_name: transformers
pipeline_tag: fill-mask

BanglaClickBERT

This repository contains BanglaClickBERT, a more pretrained version of the model BanglaBERT, specifically designed to address the challenge of clickbait detection in Bengali (Bangla) news headlines. This specialized language model leverages the Masked Language Model (MLM) approach to gain contextual understanding and enhance its ability to identify clickbait content. The model's pretraining data, collected from clickbait-prone news websites, consists of 1 million unlabeled Bangla news headlines, ensuring adaptability across various contexts.

Uses

from transformers import AutoModelForPreTraining, AutoTokenizer
import torch

model = AutoModelForPreTraining.from_pretrained("samanjoy2/banglaclickbert_base")
tokenizer = AutoTokenizer.from_pretrained("samanjoy2/banglaclickbert_base")

original_sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"
fake_sentence = "আমি হতাশ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।"

fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = model(fake_inputs).logits
predictions = torch.round((torch.sign(discriminator_outputs) + 1) / 2)

[print("%7s" % token, end="") for token in fake_tokens]
print("\n" + "-" * 50)
[print("%7s" % int(prediction), end="") for prediction in predictions.squeeze().tolist()[1:-1]]
print("\n" + "-" * 50)

Direct Use

BanglaClickBERT can be directly used for clickbait detection in Bengali (Bangla) news headlines. Its primary intended use is to help identify and filter out clickbait content from news articles, websites, or other textual sources written in the Bengali language. This can be valuable for news organizations, social media platforms, or anyone interested in promoting accurate and trustworthy information.

Bias, Risks, and Limitations

  • Data Bias: One of the primary challenges with models like BanglaClickBERT is data bias. If the model's pretraining data is collected from clickbait-prone news websites, there is a risk that the model may exhibit biases present in those sources. This can result in the model being more sensitive to certain types of clickbait and less accurate in detecting others. Efforts should be made to mitigate and address data bias.

  • Contextual Limitations: While BanglaClickBERT is designed for clickbait detection in Bengali news headlines, it may not perform as effectively in contexts or languages different from the one it was trained on. It may not be suitable for detecting clickbait in non-Bengali languages or in different cultural contexts.

  • False Positives and Negatives: Like any clickbait detection model, BanglaClickBERT may produce false positives (genuine content incorrectly identified as clickbait) and false negatives (clickbait content that goes undetected). Users and organizations should be aware of these limitations and consider additional checks.

  • Evolving Clickbait Techniques: Clickbait techniques are constantly evolving. The model may not be immediately effective at identifying new or sophisticated clickbait strategies. Continuous model updates and monitoring are necessary to keep pace with changing tactics.

  • Limited Context: The model processes individual headlines and may not consider the broader context of the entire news article or website. Some clickbait may rely on the content within the article itself, which may not be fully addressed by headline analysis alone.

Training Details

Training Data

We collected a diverse set of clickbait news headlines comprising 1 million samples from various online sources. These headlines were chosen to cover a wide range of clickbait headlines, ensuring the model's adaptability to different contexts like news on lifestyle, entertainment, business, viral videos etc.

Training Procedure

Utilize the Transformer architecture for pretraining the model. Pretraining typically involves training the model as a Masked Language Model (MLM) on the unlabeled data. The MLM approach involves randomly masking words or tokens in the input and training the model to predict the missing tokens based on the context provided by the surrounding tokens. During pretraining, the model learns the linguistic patterns, context, and features of the Bangla language. The vast amount of unlabeled data is crucial for the model's general language understanding.

Speeds, Sizes, Times

BanglaClickBERT is a BERT-based model with 12 layers. It utilizes the foundational architecture of BERT (Bidirectional Encoder Representations from Transformers) with 12 transformer encoder layers.

Citation

If you use this model, please cite the following paper:

@inproceedings{}