|
--- |
|
tasks: |
|
|
|
- text-generation |
|
|
|
model_type: |
|
|
|
- gpt |
|
- llama |
|
|
|
domain: |
|
|
|
- nlp |
|
|
|
language: |
|
|
|
- en |
|
- zh |
|
- cn |
|
|
|
tags: |
|
- transformer |
|
- 封神榜 |
|
--- |
|
# Ziya2-13B-Base |
|
|
|
- Main Page:[Fengshenbang](https://fengshenbang-lm.com/) |
|
- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM) |
|
|
|
|
|
# 姜子牙系列模型 |
|
|
|
- [Ziya-LLaMA-13B-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
|
- [Ziya-LLaMA-7B-Reward](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward) |
|
- [Ziya-LLaMA-13B-Pretrain-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-Pretrain-v1) |
|
- [Ziya-BLIP2-14B-Visual-v1](https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1) |
|
|
|
## 简介 Brief Introduction |
|
|
|
Ziya2-13B-Base 是基于LLaMa2的130亿参数大规模预训练模型,针对中文分词优化,并完成了中英文 650B tokens 的增量预训练,进一步提升了中文生成和理解能力。 |
|
|
|
The Ziya2-13B-Base is a large-scale pre-trained model based on LLaMA2 with 13 billion parameters. We optimizes LLaMAtokenizer on chinese, and incrementally train 650 billion tokens of data based on LLaMa2-13B model, which significantly improved the understanding and generation ability on Chinese. |
|
|
|
## 模型分类 Model Taxonomy |
|
|
|
| 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra | |
|
|:----------:|:-------:|:---------:|:--------:|:------------:|:---------------:| |
|
| 通用 General | AGI模型 | 姜子牙 Ziya | LLaMA2 | 13B | English&Chinese | |
|
|
|
## 模型信息 Model Information |
|
|
|
### 继续预训练 Continual Pretraining |
|
|
|
Meta在2023年7月份发布了Llama2系列大模型,相比于LLaMA1的1.4万亿Token数据,Llama2预训练的Token达到了2万亿,并在各个榜单中明显超过LLaMA1。 |
|
|
|
Meta released the Llama2 series of large models in July 2023, with pre-trained tokens reaching 200 billion compared to Llama1's 140 billion tokens, significantly outperforming Llama1 in various rankings. |
|
|
|
Ziya2-13B-Base沿用了Ziya-LLaMA-13B高效的中文编解码方式,但采取了更优化的初始化算法使得初始训练loss更低。同时,我们对Fengshen-PT继续训练框架进行了优化,效率方面,整合了FlashAttention2、Apex RMS norm等技术来帮助提升效率,对比Ziya-LLaMA-13B训练速度提升38%(163 TFLOPS/per gpu/per sec)。稳定性方面,我们采取BF16进行训练,修复了底层分布式框架的bug,确保模型能够持续稳定训练,解决了Ziya-LLaMA-13B遇到的训练后期不稳定的问题,并在7.25号进行了直播,最终完成了全部数据的继续训练。我们也发现,模型效果还有进一步提升的趋势,后续也会对Ziya2-13B-Base进行继续优化。 |
|
|
|
Ziya2-13B-Base retained the efficient Chinese encoding and decoding techniques of Ziya-LLaMA-13B, but employed a more optimized initialization algorithm to achieve lower initial training loss. Additionally, we optimized the Fengshen-PT fine-tuning framework. In terms of efficiency, we integrated technologies such as FlashAttention2 and Apex RMS norm to boost efficiency, resulting in a 38% increase in training speed compared to Ziya-LLaMA-13B (163 TFLOPS per GPU per second). For stability, we used BF16 for training, fixed underlying distributed framework bugs to ensure consistent model training, and resolved the late-stage instability issues encountered in the training of Ziya-LLaMA-13B. We also conducted a live broadcast on July 25th to complete the continued training of all data. We have observed a trend towards further improvements in model performance and plan to continue optimizing Ziya2-13B-Base in the future. |
|
|
|
![loss曲线](./img2.png) |
|
|
|
### 效果评估 Performance |
|
|
|
Ziya2-13B-Base在Llama2-13B的基础上进行了约650B自建高质量中英文数据集的继续训练,在中文、英文、数学、代码等下游理解任务上相对于Llama2-13B取得了明显的提升,相对Ziya-LLaMA-13B也有明显的提升。 |
|
|
|
The model Ziya2-13B-Base underwent further training on approximately 650 billion self-collected high-quality Chinese and English datasets, building upon the foundation of Llama2-13B. It achieved significant improvements in downstream comprehension tasks such as Chinese, English, mathematics, and code understanding, surpassing Llama2-13B and showing clear advancements compared to Ziya-LLaMA-13B. |
|
|
|
![效果评估](./img3.png) |
|
|
|
## 使用 Usage |
|
|
|
加载模型,进行的续写: |
|
|
|
Load the model and predicting: |
|
|
|
```python3 |
|
from transformers import AutoTokenizer |
|
from transformers import LlamaForCausalLM |
|
import torch |
|
|
|
query="问题:我国的三皇五帝分别指的是谁?答案: |
|
model = LlamaForCausalLM.from_pretrained('IDEA-CCNL/Ziya2-13B-Base', torch_dtype=torch.float16, device_map="auto").eval() |
|
tokenizer = AutoTokenizer.from_pretrained(ckpt) |
|
input_ids = tokenizer(query, return_tensors="pt").input_ids.to('cuda:0') |
|
generate_ids = model.generate( |
|
input_ids, |
|
max_new_tokens=512, |
|
do_sample = True, |
|
top_p = 0.9) |
|
output = tokenizer.batch_decode(generate_ids)[0] |
|
print(output) |
|
``` |
|
|
|
上面是简单的续写示例,其他更多prompt和玩法,感兴趣的朋友可以下载下来自行发掘。 |
|
|
|
The above is a simple example of continuing writing. For more prompts and creative ways to use the model, interested individuals can download it and explore further on their own. |
|
|
|
## 引用 Citation |
|
|
|
如果您在您的工作中使用了我们的模型,可以引用我们的[论文](https://arxiv.org/abs/2210.08590): |
|
|
|
If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2210.08590): |
|
|
|
```text |
|
@article{Ziya2, |
|
author = {Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiaojun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang, Qi Yang, Jiaxing Zhang, Yan Song}, |
|
title = {Ziya2: Data-centric Learning is All LLMs Need}, |
|
year = {2023} |
|
} |
|
``` |
|
|
|
You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/): |
|
|
|
欢迎引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/): |
|
|
|
```text |
|
@misc{Fengshenbang-LM, |
|
title={Fengshenbang-LM}, |
|
author={IDEA-CCNL}, |
|
year={2021}, |
|
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, |
|
} |
|
``` |
|
|