sho-takase commited on
Commit
d6eef98
1 Parent(s): 5921f5a

add readme

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md CHANGED
@@ -1,3 +1,68 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - ja
5
  ---
6
+
7
+ # Sarashina1-7B
8
+
9
+ This repository provides Japanese language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
10
+
11
+
12
+ ## How to use
13
+
14
+ ```
15
+ import torch
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
17
+
18
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina1-7b", torch_dtype=torch.float16)
19
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina1-7b", use_fast=False)
20
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
21
+ set_seed(123)
22
+
23
+ text = generator(
24
+ "おはようございます、今日の天気は",
25
+ max_length=30,
26
+ do_sample=True,
27
+ pad_token_id=tokenizer.pad_token_id,
28
+ num_return_sequences=3,
29
+ )
30
+
31
+ for t in text:
32
+ print(t)
33
+
34
+ # These examples are generated by sarashina1-7b parameters model
35
+ # {'generated_text': 'おはようございます、今日の天気は晴れ!!最高気温は15度、最低気温は7度です。今日も1日頑張りましょー♪写真は、去年'}
36
+ # {'generated_text': 'おはようございます、今日の天気は曇り:cloud:です。 雨予報なので、洗濯物は家の中へ。 :city_sunrise:の見える時間。 今日は'}
37
+ # {'generated_text': 'おはようございます、今日の天気は、晴れ、気温も10度以上に上がるそうです、お日様が当たっていると15度くらいになると思います、朝の'}
38
+ ```
39
+
40
+ ## Configuration
41
+
42
+ | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
43
+ | :-----: | :-----------: | :-------------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
44
+ | [7B](https://huggingface.co/sbintuitions/sarashina1-7b) | 51200 | 1.0T | GPTNeoX | RoPE | 32 | 4096 | 32 |
45
+ | [13B](https://huggingface.co/sbintuitions/sarashina1-13b) | 51200 | 1.0T | GPTNeoX | RoPE | 40 | 5120 | 40 |
46
+ | [65B](https://huggingface.co/sbintuitions/sarashina1-65b) | 51200 | 800B | GPTNeoX | RoPE | 80 | 8192 | 64 |
47
+
48
+ ## Training Corpus
49
+
50
+ We used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
51
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
52
+ After cleaning, our corpus contains about 550B tokens.
53
+
54
+ ## Tokenization
55
+
56
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
57
+ We do not apply pre-tokenization with Japanese tokenizer.
58
+ Thus, a user may directly feed raw sentences into the tokenizer.
59
+
60
+
61
+ ## Ethical Considerations and Limitations
62
+ Sarashina1 has not been tuned to follow an instruction yet.
63
+ Therefore, sarashina1 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
64
+ Before using sarashina1, we would like developers to tune models based on human preferences and safety considerations.
65
+
66
+ ## License
67
+
68
+ [MIT License](https://huggingface.co/sbintuitions/sarashina1-7b/blob/main/LICENSE)