sho-takase commited on
Commit
a5e3187
1 Parent(s): 4dbed22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -67
README.md CHANGED
@@ -1,68 +1,68 @@
1
- ---
2
- license: mit
3
- language:
4
- - ja
5
- ---
6
-
7
- # Sarashina1-7B
8
-
9
- This repository provides Japanese language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
10
-
11
-
12
- ## How to use
13
-
14
- ```
15
- import torch
16
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
17
-
18
- model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina1-7b", torch_dtype=torch.float16)
19
- tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina1-7b", use_fast=False)
20
- generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")
21
- set_seed(123)
22
-
23
- text = generator(
24
- "おはようございます、今日の天気は",
25
- max_length=30,
26
- do_sample=True,
27
- pad_token_id=tokenizer.pad_token_id,
28
- num_return_sequences=3,
29
- )
30
-
31
- for t in text:
32
- print(t)
33
-
34
- # These examples are generated by sarashina1-7b parameters model
35
- # {'generated_text': 'おはようございます、今日の天気は晴れ!!最高気温は15度、最低気温は7度です。今日も1日頑張りましょー♪写真は、去年'}
36
- # {'generated_text': 'おはようございます、今日の天気は曇り:cloud:です。 雨予報なので、洗濯物は家の中へ。 :city_sunrise:の見える時間。 今日は'}
37
- # {'generated_text': 'おはようございます、今日の天気は、晴れ、気温も10度以上に上がるそうです、お日様が当たっていると15度くらいになると思います、朝の'}
38
- ```
39
-
40
- ## Configuration
41
-
42
- | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
43
- | :-----: | :-----------: | :-------------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
44
- | [7B](https://huggingface.co/sbintuitions/sarashina1-7b) | 51200 | 1.0T | GPTNeoX | RoPE | 32 | 4096 | 32 |
45
- | [13B](https://huggingface.co/sbintuitions/sarashina1-13b) | 51200 | 1.0T | GPTNeoX | RoPE | 40 | 5120 | 40 |
46
- | [65B](https://huggingface.co/sbintuitions/sarashina1-65b) | 51200 | 800B | GPTNeoX | RoPE | 80 | 8192 | 64 |
47
-
48
- ## Training Corpus
49
-
50
- We used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
51
- To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
52
- After cleaning, our corpus contains about 550B tokens.
53
-
54
- ## Tokenization
55
-
56
- We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
57
- We do not apply pre-tokenization with Japanese tokenizer.
58
- Thus, a user may directly feed raw sentences into the tokenizer.
59
-
60
-
61
- ## Ethical Considerations and Limitations
62
- Sarashina1 has not been tuned to follow an instruction yet.
63
- Therefore, sarashina1 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
64
- Before using sarashina1, we would like developers to tune models based on human preferences and safety considerations.
65
-
66
- ## License
67
-
68
  [MIT License](https://huggingface.co/sbintuitions/sarashina1-7b/blob/main/LICENSE)
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ ---
6
+
7
+ # Sarashina1-7B
8
+
9
+ This repository provides Japanese language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
10
+
11
+
12
+ ## How to use
13
+
14
+ ```
15
+ import torch
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
17
+
18
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina1-7b", torch_dtype=torch.float16, device_map="auto")
19
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina1-7b", use_fast=False)
20
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
21
+ set_seed(123)
22
+
23
+ text = generator(
24
+ "おはようございます、今日の天気は",
25
+ max_length=30,
26
+ do_sample=True,
27
+ pad_token_id=tokenizer.pad_token_id,
28
+ num_return_sequences=3,
29
+ )
30
+
31
+ for t in text:
32
+ print(t)
33
+
34
+ # These examples are generated by sarashina1-7b parameters model
35
+ # {'generated_text': 'おはようございます、今日の天気は晴れ!!最高気温は15度、最低気温は7度です。今日も1日頑張りましょー♪写真は、去年'}
36
+ # {'generated_text': 'おはようございます、今日の天気は曇り:cloud:です。 雨予報なので、洗濯物は家の中へ。 :city_sunrise:の見える時間。 今日は'}
37
+ # {'generated_text': 'おはようございます、今日の天気は、晴れ、気温も10度以上に上がるそうです、お日様が当たっていると15度くらいになると思います、朝の'}
38
+ ```
39
+
40
+ ## Configuration
41
+
42
+ | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
43
+ | :-----: | :-----------: | :-------------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
44
+ | [7B](https://huggingface.co/sbintuitions/sarashina1-7b) | 51200 | 1.0T | GPTNeoX | RoPE | 32 | 4096 | 32 |
45
+ | [13B](https://huggingface.co/sbintuitions/sarashina1-13b) | 51200 | 1.0T | GPTNeoX | RoPE | 40 | 5120 | 40 |
46
+ | [65B](https://huggingface.co/sbintuitions/sarashina1-65b) | 51200 | 800B | GPTNeoX | RoPE | 80 | 8192 | 64 |
47
+
48
+ ## Training Corpus
49
+
50
+ We used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
51
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
52
+ After cleaning, our corpus contains about 550B tokens.
53
+
54
+ ## Tokenization
55
+
56
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
57
+ We do not apply pre-tokenization with Japanese tokenizer.
58
+ Thus, a user may directly feed raw sentences into the tokenizer.
59
+
60
+
61
+ ## Ethical Considerations and Limitations
62
+ Sarashina1 has not been tuned to follow an instruction yet.
63
+ Therefore, sarashina1 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
64
+ Before using sarashina1, we would like developers to tune models based on human preferences and safety considerations.
65
+
66
+ ## License
67
+
68
  [MIT License](https://huggingface.co/sbintuitions/sarashina1-7b/blob/main/LICENSE)