Add normalization steps
Browse files
README.md
CHANGED
@@ -34,7 +34,11 @@ python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 4200
|
|
34 |
Steps:
|
35 |
|
36 |
- [ ] Remove stretched words such as ســــــــــلام
|
|
|
37 |
- [ ] Remove links, user-mentioning (such as @jane_doe)
|
38 |
-
|
|
|
|
|
39 |
- [ ] Remove advertisement records
|
40 |
-
|
|
|
|
34 |
Steps:
|
35 |
|
36 |
- [ ] Remove stretched words such as ســــــــــلام
|
37 |
+
|
38 |
- [ ] Remove links, user-mentioning (such as @jane_doe)
|
39 |
+
|
40 |
+
- [ ] Remove Telegram, Instagram advertisements, or posts (a whole record)
|
41 |
+
|
42 |
- [ ] Remove advertisement records
|
43 |
+
|
44 |
+
- [ ] Remove separated words (or the whole record) which are showing up as an individual record, while they are just the tags at the end of the post (such as بلاب ... بلاب ... ورزشی، خبری، سیاسی، اجتماعی، خانوده)
|