yoonyoon commited on
Commit
a3482bc
·
1 Parent(s): 5b96858
.ipynb_checkpoints/config-checkpoint.json DELETED
@@ -1,27 +0,0 @@
1
- {
2
- "_name_or_path": "./yi-ko-6b/",
3
- "architectures": [
4
- "LlamaForCausalLM"
5
- ],
6
- "bos_token_id": 1,
7
- "eos_token_id": 2,
8
- "hidden_act": "silu",
9
- "hidden_size": 4096,
10
- "initializer_range": 0.02,
11
- "intermediate_size": 11008,
12
- "max_position_embeddings": 2048,
13
- "model_type": "llama",
14
- "num_attention_heads": 32,
15
- "num_hidden_layers": 32,
16
- "num_key_value_heads": 4,
17
- "pad_token_id": 0,
18
- "pretraining_tp": 1,
19
- "rms_norm_eps": 1e-05,
20
- "rope_scaling": null,
21
- "rope_theta": 10000.0,
22
- "tie_word_embeddings": false,
23
- "torch_dtype": "bfloat16",
24
- "transformers_version": "4.33.1",
25
- "use_cache": true,
26
- "vocab_size": 78464
27
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md DELETED
@@ -1,93 +0,0 @@
1
- ---
2
- extra_gated_heading: Access beomi/Yi-Ko-6B on Hugging Face
3
- extra_gated_button_content: Submit
4
- extra_gated_fields:
5
- I agree to share my name, email address and username: checkbox
6
- I confirm that I understand this project is for research purposes only, and confirm that I agree to follow the LICENSE of this model: checkbox
7
- language:
8
- - en
9
- - ko
10
- pipeline_tag: text-generation
11
- inference: false
12
- tags:
13
- - pytorch
14
- - Yi-Ko
15
- - 01-ai
16
- - Yi
17
- library_name: transformers
18
- license: other
19
- ---
20
-
21
- > Update @ 2023.12.03 Yi-Ko(KoEN)-6B Achieved #1🥇 Pretrained Models at [Open Korean LLM Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard)! 🎉
22
-
23
- > Update @ 2023.12.01 Alpha Release of Yi-Ko(KoEN)-6B model 🎉
24
-
25
- # **beomi/Yi-Ko-6B**
26
-
27
- Yi-Ko series models serve as advanced iterations of 01-ai/Yi models,
28
- benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining.
29
- Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
30
- This repository focuses on the **6B** pretrained version,
31
- which is tailored to fit the Hugging Face Transformers format.
32
- For access to the other models, feel free to consult the index provided below.
33
-
34
- ## Model Details
35
-
36
- **Model Developers** Junbum Lee (Beomi)
37
-
38
- **Variations** Yi-Ko series will come in a range of parameter sizes — 6B and 34B variations.
39
-
40
- **Input** Models input text only.
41
-
42
- **Output** Models generate text only.
43
-
44
- **Model Architecture**
45
-
46
- Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.
47
-
48
- <small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>
49
-
50
- |Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Batch Size(per step)|
51
- |---|---|---|---|---|---|---|---|
52
- |Yi-Ko-6B|*A mix of Korean + English online data*|6B|4k|O|>60B|5e<sup>-5</sup>|2048|
53
-
54
- **Vocab Expansion**
55
-
56
- | Model Name | Vocabulary Size | Description |
57
- | --- | --- | --- |
58
- | Original Yi-Series | 64000 | Sentencepiece BPE |
59
- | **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |
60
-
61
- **Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**
62
-
63
- | Model | # of tokens | Tokens |
64
- | --- | --- | --- |
65
- | Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
66
- | **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
67
- |<small>*Equal Korean vocab with Llama-2-Ko Series</small>||
68
-
69
- **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
70
-
71
- | Model | # of tokens | Tokens |
72
- | --- | --- | --- |
73
- | Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
74
- | **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
75
- |<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|
76
-
77
- # **Model Benchmark**
78
-
79
- ## LM Eval Harness - Korean (polyglot branch)
80
-
81
- TBD
82
-
83
- ## LICENSE
84
-
85
- TBD
86
-
87
- ## Citation
88
-
89
- TBD
90
-
91
- ## Acknowledgement
92
-
93
- The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.