daspartho commited on
Commit
d5d5ca7
1 Parent(s): 15c4d2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -1,3 +1,18 @@
1
  ---
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: apache-2.0
4
  ---
5
+
6
+ An NLP model that predicts subreddit based on the title of a post.
7
+
8
+ ### Training
9
+
10
+ DistilBERT is fine-tuned on [subreddit-posts](https://huggingface.co/datasets/daspartho/subreddit-posts), a dataset of titles of the top 1000 posts from the top 125 subreddits.
11
+
12
+ For steps to make the model check out the [model](https://github.com/daspartho/predict-subreddit/blob/main/model.ipynb) notebook in the github repo or open in [Colab](https://colab.research.google.com/github/daspartho/predict-subreddit/blob/main/model.ipynb).
13
+
14
+ ### Limitations and bias
15
+
16
+ - Since the model is trained on top 125 subreddits ([for reference](http://redditlist.com/)) therefore it can only categorise within those subreddits.
17
+ - Some subreddits have a specific format for their post title, like [r/todayilearned](https://www.reddit.com/r/todayilearned) where post title starts with "TIL" so the model becomes biased towards "TIL" --> r/todayilearned. This can be removed by cleaning the dataset of these specific terms.
18
+ - In some subreddit like [r/gifs](https://www.reddit.com/r/gifs/), the title of the post doesn't matter much, so the model performs poorly on them.