Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,18 @@
|
|
1 |
---
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: en
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
+
|
6 |
+
An NLP model that predicts subreddit based on the title of a post.
|
7 |
+
|
8 |
+
### Training
|
9 |
+
|
10 |
+
DistilBERT is fine-tuned on [subreddit-posts](https://huggingface.co/datasets/daspartho/subreddit-posts), a dataset of titles of the top 1000 posts from the top 125 subreddits.
|
11 |
+
|
12 |
+
For steps to make the model check out the [model](https://github.com/daspartho/predict-subreddit/blob/main/model.ipynb) notebook in the github repo or open in [Colab](https://colab.research.google.com/github/daspartho/predict-subreddit/blob/main/model.ipynb).
|
13 |
+
|
14 |
+
### Limitations and bias
|
15 |
+
|
16 |
+
- Since the model is trained on top 125 subreddits ([for reference](http://redditlist.com/)) therefore it can only categorise within those subreddits.
|
17 |
+
- Some subreddits have a specific format for their post title, like [r/todayilearned](https://www.reddit.com/r/todayilearned) where post title starts with "TIL" so the model becomes biased towards "TIL" --> r/todayilearned. This can be removed by cleaning the dataset of these specific terms.
|
18 |
+
- In some subreddit like [r/gifs](https://www.reddit.com/r/gifs/), the title of the post doesn't matter much, so the model performs poorly on them.
|