File size: 14,698 Bytes
64300ed
 
1d2dc7d
 
 
 
 
 
64300ed
1d2dc7d
 
 
 
 
 
 
 
 
 
 
 
 
39c0658
1d2dc7d
 
 
0e06ad0
d050b3f
0e06ad0
 
590b4b9
8499a6e
590b4b9
1d2dc7d
 
 
9d4a4d9
6da2969
201dd8a
 
 
 
 
f3bfc9d
1d2dc7d
 
 
0e06ad0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9600986
 
 
144fe25
 
 
9600986
 
144fe25
1d2dc7d
 
 
 
0e06ad0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9600986
1d2dc7d
 
 
7af1936
1d2dc7d
 
 
 
 
 
 
 
 
 
 
 
 
201dd8a
 
 
 
 
 
 
1d2dc7d
0226389
 
 
 
 
 
 
 
 
0e06ad0
 
 
 
 
 
 
 
1d2dc7d
 
 
8499a6e
1d2dc7d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
tags:
- ' TransNormerLLM'
---

<div align="center">
<h1>
  TransNormerLLM3 -- A Faster and Better LLM
</h1>
</div>

# Introduction

This official repository unveils the TransNormerLLM3 model along with its open-source weights for every 50 billion tokens processed during pre-training.

[TransNormerLLM](https://arxiv.org/abs/2307.14995) evolving from [TransNormer](https://arxiv.org/abs/2210.10340), standing out as the first LLM within the linear transformer architecture. Additionally, it distinguishes itself by being the first non-Transformer LLM to exceed both traditional Transformer and other efficient Transformer models (such as, RetNet and Mamba) in terms of speed and performance.

> Update@Apr.7: We plan to scale the sequence length in pre-training stage to **10 million**: https://twitter.com/opennlplab/status/1776894730015789300

# TransNormerLLM3
- **TransNormerLLM3-15B** features **14.83 billion** parameters. It is structured with **42 layers**, includes **40 attention heads**, and has a total **embedding size of 5120**.
- **TransNormerLLM3-15B** is purely intergrated with **[Lightning Attention-2](http://arxiv.org/abs/2401.04658)**, which can maintain a **stable TGS** during training of **unlimited sequence lengths**, up until encountering firm limitations like GPU memory constraints. 
- **Titoken** tokenizer is used with a total **vocabulary size** of about **100,000**. 
- Our **training framework** has been enhanced with integration to **[LASP](https://arxiv.org/abs/2404.02882) (Linear Attention Sequence Parallelism)**, allowing for sequence parallelism within linear attention models.
- Our **training framework** now supprts **[CO2](https://arxiv.org/abs/2401.16265)**, which introduces **local updates** and **asynchronous communication** into distributed data parallel training, achieving **full overlap** of communication and computation.
<p align="center">
  <img src="./images/TransNormer3.jpg" width="65%" />
</p>

### Pre-training Logbook
* Realtime Track: https://api.wandb.ai/links/opennlplab/kip314lq  
* Join to dicussion: [discord](https://discord.gg/JEU3nTcWKC) <<<>>> [wechat group](https://github.com/OpenNLPLab/TransnormerLLM/blob/main/images/contact_me_qr.png)
> --23.12.25-- startup: [WeChat - ้ข„่ฎญ็ปƒๅฏ่ˆช](https://mp.weixin.qq.com/s/YjUY-uy89WkF75_-rBTuKw)  <<<>>>  [Twitter - Pre-training Commences ](https://twitter.com/opennlplab/status/1739568669502611825) <<<>>> [YouTube Recording](https://t.co/wk7svS4o5r)   <<<>>> [bilibili ๅ›žๆ”พ](https://www.bilibili.com/video/BV11j411J7Dy)  
> --24.01.02-- first week review: [WeChat - ็ฌฌไธ€ๅ‘จๆฆ‚่งˆ](https://mp.weixin.qq.com/s/zwGnZZI3itNPoxzzXkuU2w) <<<>>>   [Twitter - Week 1 Review](https://twitter.com/opennlplab/status/1742187694078501038)  
> --24.01.09-- second week review: [WeChat - ็ฌฌไบŒๅ‘จๆฆ‚่งˆ](https://mp.weixin.qq.com/s/6D0qi-0aBier05OKuHfPEA) <<<>>>   [Twitter - Week 2 Review](https://twitter.com/opennlplab/status/1744720007299523063)  
> --24.01.15-- third week review: [WeChat - ็ฌฌไธ‰ๅ‘จๆฆ‚่งˆ](https://mp.weixin.qq.com/s/EQg8evZ2cNtAk4HruwCXPA) <<<>>>   [Twitter - Week 3 Review](https://twitter.com/opennlplab/status/1746920293069910190)  
> --24.01.23-- third week review: [WeChat - ็ฌฌๅ››ๅ‘จๆฆ‚่งˆ](https://mp.weixin.qq.com/s/l7LrFGQKkPU38exUtSF4cw) <<<>>>   [Twitter - Week 4  Review](https://twitter.com/opennlplab/status/1749821039360840001)  
> --24.01.30-- third week review: [WeChat - ็ฌฌไบ”ๅ‘จๆฆ‚่งˆ](https://mp.weixin.qq.com/s/OgtQIb749IbX6y5C01bLFg) <<<>>>   [Twitter - Week 5 Review](https://twitter.com/opennlplab/status/1752366090754425283)  


# Released Weights

|  param  | token |                                                       Hugging Face                                                        | Model Scope | Wisemodel |
| :-----: | :---: | :-----------------------------------------------------------------------------------------------------------------------: | :---------: | :-------: |
| **15B** |  50B  |   ๐Ÿค—[step13000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step13000-50Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 100B  |  ๐Ÿค—[step26000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step26000-100Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 150B  |  ๐Ÿค—[step39000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step39000-150Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 200B  |  ๐Ÿค—[step52000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step52000-200Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 250B  |  ๐Ÿค—[step65000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step65000-250Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 300B  |  ๐Ÿค—[step78000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step78000-300Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 350B  |  ๐Ÿค—[step92000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step92000-350Btokens)   |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 400B  | ๐Ÿค—[step105000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step105000-400Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 450B  | ๐Ÿค—[step118000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step118000-450Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 500B  | ๐Ÿค—[step131000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step131000-500Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 550B  | ๐Ÿค—[step144000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step144000-550Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 600B  | ๐Ÿค—[step157000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step157000-600Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 650B  | ๐Ÿค—[step170000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step170000-650Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 700B  | ๐Ÿค—[step183000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step183000-700Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 750B  | ๐Ÿค—[step195500](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step195500-750Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 800B  | ๐Ÿค—[step209000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step209000-800Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 850B  | ๐Ÿค—[step222000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step222000-850Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 900B  | ๐Ÿค—[step235000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step235000-900Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 950B  | ๐Ÿค—[step248000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step248000-950Btokens)  |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1000B | ๐Ÿค—[step261000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step261000-1000Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1050B | ๐Ÿค—[step274000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step274000-1050Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1100B | ๐Ÿค—[step287000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step287000-1100Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1150B | ๐Ÿค—[step300000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step300000-1150Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1200B | ๐Ÿค—[step313500](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step313500-1200Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1250B | ๐Ÿค—[step326000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step287000-1250Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1300B | ๐Ÿค—[step339500](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step326000-1300Btokens) |      ๐Ÿค–      |     ๐Ÿฏ     |
| **15B** | 1345B |     ๐Ÿค—[stage1](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/stage1-1345Btokens)     |      ๐Ÿค–      |     ๐Ÿฏ     |



```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", revision='step235000-900Btokens', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", torch_dtype=torch.bfloat16, revision='step235000-900Btokens', device_map="auto", trust_remote_code=True)
```

# Benchmark Results
The evaluations of all models are conducted using the official settings and the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework.

| Model                   | P   | T      | BoolQ | PIQA  | HS    | WG    | ARC-e | ARC-c | OBQA  | C-Eval | MMLU  |
| ----------------------- | --- | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ----- |
| **TransNormerLLM3-15B** | 15  | 0.05   | 62.08 | 72.52 | 55.55 | 57.14 | 62.12 | 31.14 | 32.40 | 26.18  | 27.50 |
| **TransNormerLLM3-15B** | 15  | 0.10   | 63.98 | 74.70 | 61.09 | 61.33 | 65.95 | 34.64 | 35.60 | 25.38  | 27.40 |
| **TransNormerLLM3-15B** | 15  | 0.15   | 60.34 | 75.08 | 63.99 | 62.04 | 64.56 | 34.90 | 35.20 | 22.64  | 26.60 |
| **TransNormerLLM3-15B** | 15  | 0.20   | 52.05 | 74.48 | 64.72 | 62.75 | 66.16 | 35.15 | 36.80 | 27.25  | 30.80 |
| **TransNormerLLM3-15B** | 15  | 0.25   | 66.70 | 76.50 | 66.51 | 64.80 | 66.84 | 36.18 | 39.40 | 30.87  | 36.10 |
| **TransNormerLLM3-15B** | 15  | 0.30   | 67.00 | 76.50 | 67.17 | 64.40 | 66.29 | 36.77 | 38.80 | 33.99  | 37.60 |
| **TransNormerLLM3-15B** | 15  | 0.35   | 65.78 | 75.46 | 67.88 | 66.54 | 67.34 | 38.57 | 39.60 | 36.02  | 39.20 |
| **TransNormerLLM3-15B** | 15  | 0.40   | 67.34 | 75.24 | 68.51 | 66.22 | 68.94 | 40.10 | 39.20 | 36.91  | 41.10 |
| **TransNormerLLM3-15B** | 15  | 0.45   | 69.02 | 76.28 | 69.11 | 63.77 | 65.82 | 36.01 | 39.40 | 37.17  | 42.80 |
| **TransNormerLLM3-15B** | 15  | 0.50   | 66.15 | 77.09 | 69.75 | 65.11 | 68.56 | 35.84 | 39.60 | 39.81  | 42.00 |
| **TransNormerLLM3-15B** | 15  | 0.55   | 70.24 | 74.05 | 69.96 | 65.75 | 65.61 | 36.69 | 38.60 | 40.08  | 44.00 |
| **TransNormerLLM3-15B** | 15  | 0.60   | 74.34 | 75.68 | 70.44 | 66.22 | 69.36 | 38.40 | 38.40 | 41.05  | 45.30 |
| **TransNormerLLM3-15B** | 15  | 0.65   | 73.15 | 76.55 | 71.60 | 66.46 | 69.65 | 39.68 | 40.80 | 41.20  | 44.90 |
| **TransNormerLLM3-15B** | 15  | 0.70   | 73.79 | 78.18 | 73.26 | 67.56 | 71.21 | 43.60 | 40.80 | 43.46  | 47.00 |
| **TransNormerLLM3-15B** | 15  | 0.75   | 76.45 | 78.07 | 74.22 | 69.30 | 71.21 | 43.43 | 42.20 | 43.46  | 47.80 |
| **TransNormerLLM3-15B** | 15  | 0.80   | 76.97 | 78.84 | 74.95 | 69.85 | 72.14 | 43.52 | 41.20 | 45.21  | 49.41 |
| **TransNormerLLM3-15B** | 15  | 0.85   | 72.75 | 78.35 | 75.91 | 70.48 | 74.58 | 45.22 | 41.20 | 46.27  | 49.36 |
| **TransNormerLLM3-15B** | 15  | 0.90   | 76.09 | 77.91 | 76.49 | 70.88 | 72.14 | 42.92 | 40.20 | 45.70  | 50.15 |
| **TransNormerLLM3-15B** | 15  | 0.95   | 74.28 | 78.24 | 76.63 | 72.22 | 74.12 | 44.11 | 42.40 | 46.25  | 51.43 |
| **TransNormerLLM3-15B** | 15  | 1.00   | 74.62 | 79.16 | 77.35 | 72.22 | 73.86 | 45.14 | 43.40 | 47.90  | 51.65 |
| **TransNormerLLM3-15B** | 15  | 1.05   | 76.36 | 78.94 | 77.15 | 71.35 | 74.66 | 44.45 | 42.80 | 45.87  | 52.28 |
| **TransNormerLLM3-15B** | 15  | 1.10   | 76.88 | 78.73 | 77.62 | 70.88 | 74.41 | 45.48 | 42.80 | 49.78  | 53.01 |
| **TransNormerLLM3-15B** | 15  | 1.15   | 72.87 | 79.43 | 78.12 | 72.85 | 74.75 | 46.16 | 43.20 | 49.80  | 53.04 |
| **TransNormerLLM3-15B** | 15  | 1.20   | 79.48 | 78.67 | 78.45 | 72.93 | 75.42 | 44.37 | 43.60 | 49.33  | 53.80 |
| **TransNormerLLM3-15B** | 15  | 1.25   | 79.17 | 79.16 | 78.81 | 72.93 | 75.13 | 45.99 | 43.60 | 50.44  | 54.19 |
| **TransNormerLLM3-15B** | 15  | 1.30   | 78.41 | 79.00 | 78.39 | 71.90 | 74.33 | 45.05 | 42.80 | 52.24  | 54.41 |
| **TransNormerLLM3-15B** | 15  | stage1 | 78.75 | 79.27 | 78.33 | 71.35 | 75.97 | 46.42 | 45.00 | 50.25  | 54.50 |


> **P**: parameter size (billion). **T**: tokens (trillion). **BoolQ**: acc. **PIQA**: acc. **HellaSwag**: acc_norm. **WinoGrande**: acc. **ARC-easy**: acc. **ARC-challenge**: acc_norm. **OpenBookQA**: acc_norm. **MMLU**: 5-shot acc. **C-Eval**: 5-shot acc.



# Acknowledgments and Citation

## Acknowledgments
Our project is developed based on the following open source projects:
- [tiktoken](https://github.com/openai/tiktoken) for the tokenizer.
- [metaseq](https://github.com/facebookresearch/metaseq) for training.
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluation.


## Citation
If you wish to cite our work, please use the following reference:
```
@misc{qin2024transnormerllm,
      title={TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer}, 
      author={Zhen Qin and Dong Li and Weigao Sun and Weixuan Sun and Xuyang Shen and Xiaodong Han and Yunshen Wei and Baohong Lv and Xiao Luo and Yu Qiao and Yiran Zhong},
      year={2024},
      eprint={2307.14995},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{qin2024lightning,
      title={Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models}, 
      author={Zhen Qin and Weigao Sun and Dong Li and Xuyang Shen and Weixuan Sun and Yiran Zhong},
      year={2024},
      eprint={2401.04658},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{sun2024linear,
      title={Linear Attention Sequence Parallelism}, 
      author={Weigao Sun and Zhen Qin and Dong Li and Xuyang Shen and Yu Qiao and Yiran Zhong},
      year={2024},
      eprint={2404.02882},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

<p align="center">
  <img src="./images/lightning3-leopard.jpg" width="50%" />
  - OpenNLPLab @2024 -
</p>