language: en
thumbnail: https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg
tags:
- India
- politics
- tweets
- BJP
- Congress
- AAP
- pytorch
- gpt2
- lm-head
- text-generation
license: apache-2.0
datasets:
- Twitter
- IndianPolitics
Model name
Indian Political Tweets LM
Model description
Note: This model is based on GPT2, if you want a bigger model based on GPT2-medium and finetuned on the same data please take a look at the IndianPoliticalTweetsLMMedium model.
This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this blog post.
Intended uses & limitations
This finetuned model can be used to generate tweets which are related to Indian politics.
How to use
from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline
tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer)
init_sentence = "India will always be"
print(text_generator(init_sentence))
Limitations and bias
- The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text.
- I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc.
- Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model.
Training data
I used the pre-trained gpt2 model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a blog post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.
Training procedure
For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.
I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.
Hardware
- GPU: GTX 1080Ti
- CPU: Ryzen 3900x
- RAM: 32GB
This model took roughly 36 hours to fine-tune.