chinese-baby-llama2 / readme_en.md
Qi Wang
Update readme_en.md
b7ac396

Miniature Chinese Llama2 Basic Model

English 简体中文

This is an ultra-mini model with approximately 58M parameters, utilizing the Llama2 architecture. The uploaded version is pre-trained and has not undergone SFT yet. A chat version post-SFT will be launched soon.

The goals of developing this ultra-mini model are:

  1. To practice the full process of pre-training a basic large language model from scratch.
  2. To provide a fast-deployable environment for the development of large parameter models, as loading large models can be very time-consuming and not conducive to rapid iterative development and debugging.
  3. To enable quick parameter tuning and the reproduction of various optimization algorithms on consumer-level graphics cards.

Training Data

We have collected 429 Chinese online fantasy novels and organized them into plain text format. Lines with fewer than 10 characters and those exceeding 4096 characters were removed, serving as the base data for pre-training.

The organized txt file is 3.3G in size and contains 868M Chinese characters across 18M lines.

Chinese Tokenizer

The tokenizer for the model was also retrained, without relying on any existing tokenizers.

Training Parameters:

  1. Maximum Sentence Length: 2657
  2. Vocabulary Size: 32000
  3. Normalization Rule: identity
  4. Character Coverage: 0.9995
Llama2 Baby Llama2
tokens 32000 65534
model_max_length 4096 4096
白日依山尽,黄河入海流。欲穷千里目,更上一层楼。 :['▁', '白', '日', '<0xE4>', '<0xBE>', '<0x9D>', '山', '<0xE5>', '<0xB0>', '<0xBD>', ',', '黄', '河', '入', '海', '流', '。', '<0xE6>', '<0xAC>', '<0xB2>', '<0xE7>', '<0xA9>', '<0xB7>', '千', '里', '目', ',', '更', '上', '一', '<0xE5>', '<0xB1>', '<0x82>', '<0xE6>', '<0xA5>', '<0xBC>', '。'] ['▁白', '日', '依山', '尽', ',', '黄河', '入海', '流', '。', '欲', '穷', '千里', '目', ',', '更', '上一层', '楼', '。']
[1, 29871, 30868, 30325, 231, 193, 160, 30329, 232, 179, 192, 30214, 31491, 30828, 30752, 30581, 31151, 30267, 233, 175, 181, 234, 172, 186, 31159, 30755, 30895, 30214, 31100, 30429, 30287, 232, 180, 133, 233, 168, 191, 30267] [65534, 1764, 63106, 62484, 63203, 62793, 14729, 29082, 63130, 62795, 63920, 64266, 3271, 63038, 62793, 63007, 17116, 63636, 62795]
The primary use of LLaMA is research on large language models, including BERT, XLNet, and RoBERTa. :['▁The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including', '▁B', 'ERT', ',', '▁X', 'L', 'Net', ',', '▁and', '▁Ro', 'BER', 'T', 'a', '.'] :['▁T', 'h', 'e', '▁p', 'ri', 'm', 'ar', 'y', '▁', 'u', 'se', '▁o', 'f', '▁', '<0x4C>', '<0x4C>', 'a', 'M', 'A', '▁i', 's', '▁', 're', 'se', 'ar', 'ch', '▁o', 'n', '▁', 'l', 'ar', 'g', 'e', '▁', 'l', 'ang', 'ua', 'g', 'e', '▁m', 'od', 'e', 'ls', ',', '▁', 'in', 'c', 'lu', 'd', 'i', 'ng', '▁', '<0x42>', '<0x45>', '<0x52>', 'T', ',', '▁', 'X', '<0x4C>', '<0x4E>', 'e', 't', ',', '▁', 'an', 'd', '▁', '<0x52>', 'o', '<0x42>', '<0x45>', '<0x52>', 'T', 'a', '.']
[1, 450, 7601, 671, 310, 365, 5661, 1529, 338, 5925, 373, 2919, 4086, 4733, 29892, 3704, 350, 20161, 29892, 1060, 29931, 6779, 29892, 322, 1528, 13635, 29911, 29874, 29889] [65534, 14962, 63590, 64211, 27052, 16426, 63475, 13594, 64158, 62797, 63569, 11279, 13719, 65368, 62797, 81, 81, 63518, 64918, 64752, 24145, 63338, 62797, 44186, 11279, 13594, 9251, 13719, 63541, 62797, 64399, 13594, 64101, 64211, 62797, 64399, 37035, 36500, 64101, 64211, 2939, 11320, 64211, 53670, 62793, 62797, 18944, 63603, 14575, 64096, 63484, 1171, 62797, 71, 74, 87, 64760, 62793, 62797, 65257, 81, 83, 64211, 63073, 62793, 62797, 6604, 64096, 62797, 87, 63143, 71, 74, 87, 64760, 63518, 62801]

The Llama2 tokenizer has 32,000 tokens and is optimized for English characters, while Baby Llama2 has 65,534 tokens and only includes Chinese.

It can be seen that in terms of vectorization for Chinese and English text, Baby Llama2's Chinese vectorization is better than standard Llama2, while its English vectorization is weaker than Llama2.

Full Training Corpus Processing

Before full training, the corpus is processed for vectorization. Using the recently trained tokenizer (tokenizer), the txt files of online novels are read line by line. Each line is vectorized and an eos_token_id is added at the end of the line for differentiation. All processed binary data is then stored on disk in the form of a two-dimensional np.uint16 array, with dimensions of [-1: max_sentence_length].

Pre-training

Pre-training is done on a single 3090 machine. The model uses the architecture of llama2, and the training parameters are as follows:

  1. max_seq_len = 1024
  2. dim = 768
  3. n_headers = 12
  4. n_layers = 12
  5. n_kv_headers = 12

Demonstration

Huggingface Space For Baby Llama2

Citation

llama2.c

baby-llama2-chinese