globis-university
/

deberta-v3-japanese-base

Token Classification

Inference Endpoints

Model card Files Files and versions Community

akeyhero commited on Jun 18, 2024

Commit

6eadfbc

·

verified ·

1 Parent(s): 0a7cd57

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ model = AutoModelForTokenClassification.from_pretrained(model_name)
 以下のことを意識している:
 - 推論時の形態素解析器なし
-- トークンが単語 (辞書: `unidic-cwj-202302`) の境界を跨がない
 - Hugging Faceで使いやすい
 - 大きすぎない語彙数
@@ -70,8 +70,8 @@ Although the original DeBERTa V3 is characterized by a large vocabulary size, wh
 | WikiBooks     | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
 | Aozora Bunko  | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
 | CC-100        | ja | 90GB | x1 |
-| mC4           | ja; extracted 10% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
-| OSCAR 2023    | ja; extracted 20% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
 # Training parameters
 - Number of devices: 8

 以下のことを意識している:
 - 推論時の形態素解析器なし
+- トークンが単語の境界を跨がない (辞書: `unidic-cwj-202302`)
 - Hugging Faceで使いやすい
 - 大きすぎない語彙数
 | WikiBooks     | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
 | Aozora Bunko  | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
 | CC-100        | ja | 90GB | x1 |
+| mC4           | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
+| OSCAR 2023    | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
 # Training parameters
 - Number of devices: 8