LoneWolfgang
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ This model is reccomended for Japanese SNS tasks, like [sentiment analysis](http
|
|
19 |
|
20 |
## Training Data
|
21 |
|
22 |
-
The Twitter API was used to collect
|
23 |
|
24 |
N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
|
25 |
The refined training corpus was 28 million tweets.
|
@@ -29,16 +29,4 @@ The refined training corpus was 28 million tweets.
|
|
29 |
The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
|
30 |
It shares 60% of its vocabulary with Japanese BERT.
|
31 |
|
32 |
-
The
|
33 |
-
|
34 |
-
### Model Description
|
35 |
-
|
36 |
-
<!-- Provide a longer summary of what this model is. -->
|
37 |
-
|
38 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
39 |
-
|
40 |
-
- **Developed by:** Jordan Wolfgang Klein, as Master's candiate at the University Malta.
|
41 |
-
- **Model type:** BERT
|
42 |
-
- **Language(s) (NLP):** Japanese
|
43 |
-
- **License:** [More Information Needed]
|
44 |
-
- **Finetuned from model [optional]:**
|
|
|
19 |
|
20 |
## Training Data
|
21 |
|
22 |
+
The Twitter API was used to collect Japanese tweets from June 2022 to April 2023.
|
23 |
|
24 |
N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
|
25 |
The refined training corpus was 28 million tweets.
|
|
|
29 |
The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
|
30 |
It shares 60% of its vocabulary with Japanese BERT.
|
31 |
|
32 |
+
The vocabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|