sho-takase commited on
Commit
bf3ad25
1 Parent(s): fdb7188

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -70
README.md CHANGED
@@ -1,71 +1,71 @@
1
- ---
2
- license: mit
3
- language:
4
- - ja
5
- - en
6
- ---
7
-
8
- # Sarashina2-7B
9
-
10
- This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
-
12
-
13
- ## How to use
14
-
15
- ```
16
- import torch
17
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
18
-
19
- model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-7b", torch_dtype=torch.bfloat16)
20
- tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-7b", use_fast=False)
21
- generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")
22
- set_seed(123)
23
-
24
- text = generator(
25
- "おはようございます、今日の天気は",
26
- max_length=30,
27
- do_sample=True,
28
- pad_token_id=tokenizer.pad_token_id,
29
- num_return_sequences=3,
30
- )
31
-
32
- for t in text:
33
- print(t)
34
-
35
- # These examples are generated by sarashina2-7b parameters model
36
- # {'generated_text': 'おはようございます、今日の天気は晴れです。ちょっと風が強い。\n昨日は、久しぶりにゆっくりとしていました。\n2週間位間があいてしまったかも、でもその間に'}
37
- # {'generated_text': 'おはようございます、今日の天気は曇。朝は曇っていてどんよりしていましたね。昼からは晴れそうですが。気温は徐々に上昇しています。昨日は春らしい陽気でした。'}
38
- # {'generated_text': 'おはようございます、今日の天気はくもり、少し寒気がします。 この土日に、家族で一泊二日で旅行に行ってきました。といっても、100キロ'}
39
- ```
40
-
41
- ## Configuration
42
-
43
- | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
44
- | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
45
- | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
46
- | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
47
- | 70B (TBA)| | | | | | |
48
-
49
- ## Training Corpus
50
-
51
- For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
52
- To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
53
- After cleaning, our Japanese training data contains about 1T tokens.
54
-
55
- For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
56
-
57
- ## Tokenization
58
-
59
- We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
60
- We do not apply pre-tokenization with Japanese tokenizer.
61
- Thus, a user may directly feed raw sentences into the tokenizer.
62
-
63
-
64
- ## Ethical Considerations and Limitations
65
- Sarashina2 has not been tuned to follow an instruction yet.
66
- Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
67
- Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
68
-
69
- ## License
70
-
71
  [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ - en
6
+ ---
7
+
8
+ # Sarashina2-7B
9
+
10
+ This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
+
12
+
13
+ ## How to use
14
+
15
+ ```
16
+ import torch
17
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
18
+
19
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-7b", torch_dtype=torch.bfloat16, device_map="auto")
20
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-7b", use_fast=False)
21
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
22
+ set_seed(123)
23
+
24
+ text = generator(
25
+ "おはようございます、今日の天気は",
26
+ max_length=30,
27
+ do_sample=True,
28
+ pad_token_id=tokenizer.pad_token_id,
29
+ num_return_sequences=3,
30
+ )
31
+
32
+ for t in text:
33
+ print(t)
34
+
35
+ # These examples are generated by sarashina2-7b parameters model
36
+ # {'generated_text': 'おはようございます、今日の天気は晴れです。ちょっと風が強い。\n昨日は、久しぶりにゆっくりとしていました。\n2週間位間があいてしまったかも、でもその間に'}
37
+ # {'generated_text': 'おはようございます、今日の天気は曇。朝は曇っていてどんよりしていましたね。昼からは晴れそうですが。気温は徐々に上昇しています。昨日は春らしい陽気でした。'}
38
+ # {'generated_text': 'おはようございます、今日の天気はくもり、少し寒気がします。 この土日に、家族で一泊��日で旅行に行ってきました。といっても、100キロ'}
39
+ ```
40
+
41
+ ## Configuration
42
+
43
+ | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
44
+ | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
45
+ | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
46
+ | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
47
+ | 70B (TBA)| | | | | | |
48
+
49
+ ## Training Corpus
50
+
51
+ For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
52
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
53
+ After cleaning, our Japanese training data contains about 1T tokens.
54
+
55
+ For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
56
+
57
+ ## Tokenization
58
+
59
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
60
+ We do not apply pre-tokenization with Japanese tokenizer.
61
+ Thus, a user may directly feed raw sentences into the tokenizer.
62
+
63
+
64
+ ## Ethical Considerations and Limitations
65
+ Sarashina2 has not been tuned to follow an instruction yet.
66
+ Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
67
+ Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
68
+
69
+ ## License
70
+
71
  [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)