sho-takase commited on
Commit
48f3ab9
1 Parent(s): 82de7cf

Add readme

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md CHANGED
@@ -1,3 +1,64 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - wikipedia
5
+ - mc4
6
+ - cc100
7
+ - oscar
8
+ language:
9
+ - ja
10
  ---
11
+
12
+ # japanese-large-lm-1.7b
13
+
14
+ This repository provides a 1.7B parameters Japanese language model, trained by [LINE Corporation](https://linecorp.com/ja/).
15
+
16
+ ## How to use
17
+
18
+ ```
19
+ import torch
20
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
21
+
22
+ model = AutoModelForCausalLM.from_pretrained("line-corporation/japanese-large-lm-1.7b", torch_dtype=torch.float16)
23
+ tokenizer = AutoTokenizer.from_pretrained("line-corporation/japanese-large-lm-1.7b", use_fast=False)
24
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
25
+ set_seed(101)
26
+
27
+ text = generator(
28
+ "おはようございます、今日の天気は",
29
+ max_length=30,
30
+ do_sample=True,
31
+ pad_token_id=tokenizer.pad_token_id,
32
+ num_return_sequences=5,
33
+ )
34
+
35
+ for t in text:
36
+ print(t)
37
+
38
+ # [{'generated_text': 'おはようございます、今日の天気は雨模様ですね。梅雨のこの時期の ジメジメ、ムシムシはたまらないですねえ~。 皆さんもお'},
39
+ # {'generated_text': 'おはようございます、今日の天気は快晴。 そして、朝8時15分には、 8月9日現在の、 月島・勝どき・'},
40
+ # {'generated_text': 'おはようございます、今日の天気は曇りです。 朝起きたら雪がチラついていました。 日中も雪が舞い散るような天気です。 朝から寒いですね。'},
41
+ # {'generated_text': 'おはようございます、今日の天気は雨です。昨日、天気が悪く洗濯物を干しにベランダに出た時に雨に降られ、風邪が悪化しそうです。今日洗濯'},
42
+ # {'generated_text': 'おはようございます、今日の天気は晴天ですが涼しい1日です、気温は午後になり 若干下がる予報です。 6月も10日を'}]
43
+ ```
44
+
45
+ ## Model architecture
46
+ | Model | Vocab size | Architecture | Position type | Layers | Hidden dim | Attention heads |
47
+ | :---: | :--------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
48
+ | 1.7B | 51200 | GPT2 | Absolute | 24 | 2304 | 24 |
49
+ | 3.6B | 51200 | GPTNeoX | RoPE | 30 | 3072 | 32 |
50
+
51
+ ## Training Corpus
52
+ Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar.
53
+ We also incorporated the Web texts crawled by in-house system.
54
+ The total size of our training corpus is about 650 GB.
55
+ The trained model achieves 8.57 perplexity on the internal validation sets of Japanese C4,
56
+
57
+ ## Tokenization
58
+ We use a sentencepiece tokenizer with a unigram language model and byte-fallback.
59
+ We **do not** apply pre-tokenization with Japanese tokenizer.
60
+ Thus, a user may directly feed raw sentences into the tokenizer.
61
+
62
+
63
+ ## License
64
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)