Update README.md
Browse files
README.md
CHANGED
@@ -45,7 +45,7 @@ model = AutoModelForTokenClassification.from_pretrained(model_name)
|
|
45 |
以下のことを意識している:
|
46 |
|
47 |
- 推論時の形態素解析器なし
|
48 |
-
-
|
49 |
- Hugging Faceで使いやすい
|
50 |
- 大きすぎない語彙数
|
51 |
|
@@ -70,8 +70,8 @@ Although the original DeBERTa V3 is characterized by a large vocabulary size, wh
|
|
70 |
| WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
|
71 |
| Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
|
72 |
| CC-100 | ja | 90GB | x1 |
|
73 |
-
| mC4 | ja; extracted 10
|
74 |
-
| OSCAR 2023 | ja; extracted
|
75 |
|
76 |
# Training parameters
|
77 |
- Number of devices: 8
|
|
|
45 |
以下のことを意識している:
|
46 |
|
47 |
- 推論時の形態素解析器なし
|
48 |
+
- トークンが単語の境界を跨がない (辞書: `unidic-cwj-202302`)
|
49 |
- Hugging Faceで使いやすい
|
50 |
- 大きすぎない語彙数
|
51 |
|
|
|
70 |
| WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
|
71 |
| Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
|
72 |
| CC-100 | ja | 90GB | x1 |
|
73 |
+
| mC4 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
|
74 |
+
| OSCAR 2023 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
|
75 |
|
76 |
# Training parameters
|
77 |
- Number of devices: 8
|