BAAI
/

shunxing1234 commited on
Commit
4173554
·
1 Parent(s): 4142516

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -126
README.md CHANGED
@@ -2,144 +2,35 @@
2
  license: other
3
  ---
4
 
5
- # Aquila-7B
6
 
7
- ## 简介/Overview
8
- Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的tokenizer,升级了BMTrain并行训练方法。Aquila语言大模型是在中英文高质量语料基础上从0开始训练的,通过数据质量的控制、多种训练的优化方法,实现在更小的数据集、更短的训练时间,获得比其它开源模型更优的性能。也是首个支持中英双语知识、支持商用许可协议、符合国内数据合规需要的大规模开源语言模型。
9
 
10
- The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.
11
 
12
- | 模型/Model | 状态/State | 能否商用/Commercial use? | 所用显卡/GPU |
13
- | :---------------- | :------- | :-- |:-- |
14
- | Aquila-7B | 已发布 | ✅ | Nvidia-A100 |
15
- | AquilaChat-7B |已发布 | ✅ | Nvidia-A100 |
16
- | AquilaCode-7B-NV |已发布 | ✅ | Nvidia-A100 |
17
- | AquilaCode-7B-TS |已发布 | ✅ | Tianshu-BI-V100 |
18
- | Aquila-33B | **敬请期待** | ✅ | Nvidia-A100 |
19
- | AquilaChat-33B |**敬请期待** | ✅ | Nvidia-A100 |
20
 
 
21
 
22
- 我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们升级了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。
23
 
24
- Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
25
 
26
- 我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。
27
 
28
 
29
- We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.
 
 
 
 
 
 
 
30
 
31
- The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:
32
 
33
- We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.
34
 
35
- | 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
36
- | ----- | ---- | ----- | ---- | ----- | ---- |
37
- | GPT2 | 50527 | bpe|1717 | 1764|2323 |
38
- | LLaMA | 32000 | sp(bpe)|1805| 1257|1970 |
39
- | Aquila | 100000 | bpe|1575 | 477|1679 |
40
-
41
-
42
-
43
- ## 训练数据集/Training data
44
- Aquila预训练使用了Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), 悟道中文数据集、电子书、专利、百科、论坛, github数据等, 详情可见下图。
45
-
46
- The Aquila-7B model was pretrained on Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), Wudao Corpus、e-book、Patent, encyclopedia, forum, github etc. Details are given in the figure below.
47
- ![Screenshot](./img/data_dist.png)
48
-
49
- ## 使用方式/How to use
50
-
51
- ### 1. 预训练/Pre-training
52
- #### Step 1: 修改参数/Modify Parameters
53
-
54
- * `cd /examples/aquila`
55
- * 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
56
- * 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_pretrain.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_pretrain.py`
57
- * (可选) 在`Aquila-pretrain.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-pretrain.yaml`
58
-
59
- | 参数名 Parameter | 类型 Type | 描述 Description |
60
- |--------------------------------|------------|-------------------------------------------------------|
61
- | batch_size | int | 每次迭代训练时,从数据集中抽取的样本数。一般来说,它越大,处理速度越快,但会占用更多的内存; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memory |
62
- | gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages |
63
- | lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum |
64
- | warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate
65
- | save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. |
66
-
67
- * 我们的演示数据集放在`../indexed_dataset/data/demo_text_document`里。 如果想修改预训练数据集,可更改`aquila_pretrain.py`里的`data_prefix`参数; Our demo dataset is located in `../indexed_dataset/data/demo_text_document`. If you want to modify the pre-training dataset, you can change the data_prefix parameter in `aquila_pretrain.py`.
68
- #### Step 2: 启动训练/Start training
69
- ```
70
- bash dist_trigger_docker.sh hostfile Aquila-pretrain.yaml aquila-7b [实验名]
71
- ```
72
- 接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run.
73
-
74
- ![Screenshot](./info.jpg)
75
-
76
- 成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ:
77
-
78
- ![Screenshot](./info2.jpg)
79
-
80
- ### 2. 可监督微调/Supervised Fine-tuning(SFT)
81
- #### Step 1: 修改参数/Modify Parameters
82
- * `cd /examples/aquila`
83
- * 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
84
- * 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_pretrain.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_pretrain.py`
85
- * (可选) 在`Aquila-pretrain.yaml`文件里更改参数 ; (optional) change parameters in `Aquila-pretrain.yaml`
86
-
87
-
88
-
89
- #### Step 2: 启动可监督微调/Start SFT
90
- ```
91
- bash dist_trigger_docker.sh hostfile aquila-sft.yaml aquila-7b [实验名]
92
- ```
93
- 接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run.
94
-
95
- ![Screenshot](./info.jpg)
96
-
97
- 成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ:
98
-
99
- ![Screenshot](./info2.jpg)
100
-
101
- ### 3. 推理/Inference
102
-
103
- ```python
104
- import os
105
- import torch
106
- from flagai.auto_model.auto_loader import AutoLoader
107
- from flagai.model.predictor.predictor import Predictor
108
- from flagai.data.tokenizer import Tokenizer
109
- import bminf
110
-
111
- state_dict = "./checkpoints_in/"
112
- model_name = 'aquila-7b'
113
-
114
- loader = AutoLoader(
115
- "lm",
116
- model_dir=state_dict,
117
- model_name=model_name,
118
- use_cache=True)
119
- model = loader.get_model()
120
- tokenizer = loader.get_tokenizer()
121
-
122
- model.eval()
123
- model.half()
124
- model.cuda()
125
-
126
- predictor = Predictor(model, tokenizer)
127
-
128
- text = "北京在哪儿?"
129
- text = f'{text}'
130
- print(f"text is {text}")
131
- with torch.no_grad():
132
- out = predictor.predict_generate_randomsample(text, out_max_length=200, temperature=0)
133
- print(f"pred is {out}")
134
-
135
- ```
136
-
137
-
138
-
139
-
140
- ## 证书/License
141
-
142
- Aquila-7B开源模型使用 [智源Aquila系列模型许可协议](linkhere), 原始代码基于[Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0)
143
 
144
 
145
  Aquila-7B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
2
  license: other
3
  ---
4
 
5
+ # Aquila
6
 
7
+ Aquila Language Model is the first open source language model that supports both Chinese and English knowledge, commercial license agreements, and compliance with domestic data regulations.
 
8
 
 
9
 
10
+ - 🌟 **Supports open source commercial licenses**. The source code of the Aquila series models is based on the [Apache 2.0 agreement](https://www.apache.org/licenses/LICENSE-2.0), while the model weight is based on the [BAAI Aquila Model License Agreement](../../BAAI_Aquila_Model_License.pdf). Users can use it for commercial purposes as long as they meet the licensing restrictions.
 
 
 
 
 
 
 
11
 
12
+ - ✍️ **Possesses Chinese and English knowledge**. The Aquila series model is trained from scratch on a high-quality corpus of Chinese and English languages, with Chinese corpora accounting for about 40%, ensuring that the model accumulates native Chinese world knowledge during the pre-training phase, rather than translated knowledge.
13
 
14
+ - 👮‍♀️ **Complies with domestic data regulations**. The Chinese corpora of the Aquila series models come from Intelligence Source's accumulated Chinese datasets over the years, including Chinese internet data from over 10,000 sources (more than 99% of which are domestic sources), as well as high-quality Chinese literature and book data supported by authoritative domestic organizations. We will continue to accumulate high-quality and diverse datasets and incorporate them into the subsequent training of the Aquila base models.
15
 
16
+ - 🎯 **Continuous improvements and open sourcing**. We will continue to improve training data, optimize training methods, and enhance model performance, cultivate a flourishing "model tree" on a better base model foundation, and continuously update open-source versions.
17
 
18
+ The additional details of the Aquila model will be presented in the official technical report. Please stay tuned for updates on official channels, including the [FlagAI GitHub repository](https://github.com/FlagAI-Open/FlagAI/), [FlagAI's Zhihu account](https://www.zhihu.com/people/95-22-20-18) and [FlagAI's official technical communication group](https://github.com/FlagAI-Open/FlagAI/blob/master/wechat-qrcode.jpg).
19
 
20
 
21
+ | Model | Model Type | Description | File Path | Standalone Model Download | Status | GPUs Used |
22
+ | :----------------- | :----------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------- | :--------------| :----------- |
23
+ | Aquila-7B | Base model, 7 billion parameters | **Aquila Base Model** inherits the architectural design advantages of GPT-3 and LLaMA. It replaces a batch of more efficient underlying operator implementations, redesigns the implementation of bilingual tokenizer, upgrades BMTrain parallel training method, and achieves nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2. | [./examples/Aquila/Aquila-pretrain](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-pretrain) | [Download Aquila-7B](http://model.baai.ac.cn/model-detail/100098) | Released | Nvidia-A100 |
24
+ | Aquila-33B | Base model, 33 billion parameters | Same as above | —— | Coming soon | Nvidia-A100 |
25
+ | AquilaChat-7B | SFT model, fine-tuned and RL based on Aquila-7B | **AquilaChat Dialog Model** supports fluent text dialogue and multiple language generation tasks, and realizes the call of AquilaChat to other models and tools by defining an expandable special instruction specification, which is easy to extend. For example, calling the open source **[AltDiffusion](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltDiffusion-m18) multimodal language image generation model** of Flagship Intelligence achieved smooth image generation capability. Together with Flagship Intelligence's **InstructFace multi-step controllable text-picture model**, it is easy to achieve multi-step controllable editing of human face images. | [./examples/Aquila/Aquila-chat](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-chat) | [Download AquilaChat-7B](https://model.baai.ac.cn/model-detail/100101) | Released | Nvidia-A100 |
26
+ | AquilaChat-33B | SFT model, fine-tuned and RL based on Aquila-33B | Same as above | —— | Coming soon | Nvidia-A100 |
27
+ | AquilaCode-7B-NV | Base model, "text-code" generation model, further pre-trained based on Aquila-7B, trained on Nvidia | AquilaCode-7B achieves high performance with small data sets and parameters, and is currently the best open source code model that supports both Chinese and English, trained using training code data with compliant open source licenses after high-quality filtering. AquilaCode-7B has been trained on both Nvidia and domestic chips for code models. | [./examples/Aquila/Aquila-code](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-code) | [Download AquilaCode-7B-NV](https://model.baai.ac.cn/model-detail/100102) | Released | Nvidia-A100 |
28
+ | AquilaCode-7B-TS | Base model, "text-code" generation model, further pre-trained based on Aquila-7B, trained on Horizon Robotics chips | Same as above | [./examples/Aquila/Aquila-code](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-code) | [Download AquilaCode-7B-TS](https://model.baai.ac.cn/model-detail/100099) | Released | Tianshu-BI-V100 |
29
 
 
30
 
31
+ We will continue to release improved versions of Aquila model as open source. You can start by deleting the `model_pytorch.bi`n file in the original directory and then download the new weights. Other usage methods remain unchanged. For more details, please refer to the **[Change Log](./changelog.md)**.
32
 
33
+ <br>If you have any question, please refer to the [FAQ](https://github.com/FlagAI-Open/FlagAI/issues/371) first. If you cannot solve them, please submit an [issue](https://github.com/FlagAI-Open/FlagAI/issues) directly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
 
36
  Aquila-7B open-source model is licensed under [ BAAI Aquila Model Licence Agreement](linkhere). The source code is under [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0)