---
tasks:

- text-generation

model_type:

- gpt
- llama

domain:

- nlp

language:

- en
- zh
- cn

tags:
- transformer
- 封神榜
---
# Ziya2-13B-Base

- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)
- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)


# 姜子牙系列模型

- [Ziya-LLaMA-13B-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1)
- [Ziya-LLaMA-7B-Reward](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward)
- [Ziya-LLaMA-13B-Pretrain-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-Pretrain-v1)
- [Ziya-BLIP2-14B-Visual-v1](https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1)

## 简介 Brief Introduction

Ziya2-13B-Base 是基于LLaMa2的130亿参数大规模预训练模型，针对中文分词优化，并完成了中英文 650B tokens 的增量预训练，进一步提升了中文生成和理解能力。

The Ziya2-13B-Base is a large-scale pre-trained model based on LLaMA2 with 13 billion parameters. We optimizes LLaMAtokenizer on chinese, and incrementally train 650 billion tokens of data based on LLaMa2-13B model, which significantly improved the understanding and generation ability on Chinese. 

## 模型分类 Model Taxonomy

| 需求 Demand  | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
|:----------:|:-------:|:---------:|:--------:|:------------:|:---------------:|
| 通用 General | AGI模型   | 姜子牙 Ziya  | LLaMA2   | 13B          | English&Chinese |

## 模型信息 Model Information

### 继续预训练 Continual Pretraining

Meta在2023年7月份发布了Llama2系列大模型，相比于LLaMA1的1.4万亿Token数据，Llama2预训练的Token达到了2万亿，并在各个榜单中明显超过LLaMA1。

Meta released the Llama2 series of large models in July 2023, with pre-trained tokens reaching 200 billion compared to Llama1's 140 billion tokens, significantly outperforming Llama1 in various rankings.

Ziya2-13B-Base沿用了Ziya-LLaMA-13B高效的中文编解码方式，但采取了更优化的初始化算法使得初始训练loss更低。同时，我们对Fengshen-PT继续训练框架进行了优化，效率方面，整合了FlashAttention2、Apex RMS norm等技术来帮助提升效率，对比Ziya-LLaMA-13B训练速度提升38%(163 TFLOPS/per gpu/per sec)。稳定性方面，我们采取BF16进行训练，修复了底层分布式框架的bug，确保模型能够持续稳定训练，解决了Ziya-LLaMA-13B遇到的训练后期不稳定的问题，并在7.25号进行了直播，最终完成了全部数据的继续训练。我们也发现，模型效果还有进一步提升的趋势，后续也会对Ziya2-13B-Base进行继续优化。

Ziya2-13B-Base retained the efficient Chinese encoding and decoding techniques of Ziya-LLaMA-13B, but employed a more optimized initialization algorithm to achieve lower initial training loss. Additionally, we optimized the Fengshen-PT fine-tuning framework. In terms of efficiency, we integrated technologies such as FlashAttention2 and Apex RMS norm to boost efficiency, resulting in a 38% increase in training speed compared to Ziya-LLaMA-13B (163 TFLOPS per GPU per second). For stability, we used BF16 for training, fixed underlying distributed framework bugs to ensure consistent model training, and resolved the late-stage instability issues encountered in the training of Ziya-LLaMA-13B. We also conducted a live broadcast on July 25th to complete the continued training of all data. We have observed a trend towards further improvements in model performance and plan to continue optimizing Ziya2-13B-Base in the future.

![loss曲线](./img2.png)

### 效果评估 Performance

Ziya2-13B-Base在Llama2-13B的基础上进行了约650B自建高质量中英文数据集的继续训练，在中文、英文、数学、代码等下游理解任务上相对于Llama2-13B取得了明显的提升，相对Ziya-LLaMA-13B也有明显的提升。

The model Ziya2-13B-Base underwent further training on approximately 650 billion self-collected high-quality Chinese and English datasets, building upon the foundation of Llama2-13B. It achieved significant improvements in downstream comprehension tasks such as Chinese, English, mathematics, and code understanding, surpassing Llama2-13B and showing clear advancements compared to Ziya-LLaMA-13B.

![效果评估](./img3.png)

## 使用 Usage 

加载模型，进行的续写：

Load the model and  predicting：

```python3
from transformers import AutoTokenizer
from transformers import LlamaForCausalLM
import torch

query="问题：我国的三皇五帝分别指的是谁？答案：
model = LlamaForCausalLM.from_pretrained('IDEA-CCNL/Ziya2-13B-Base', torch_dtype=torch.float16, device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt)
input_ids = tokenizer(query, return_tensors="pt").input_ids.to('cuda:0')
generate_ids = model.generate(
            input_ids,
            max_new_tokens=512, 
            do_sample = True, 
            top_p = 0.9)
output = tokenizer.batch_decode(generate_ids)[0]
print(output)
```

上面是简单的续写示例，其他更多prompt和玩法，感兴趣的朋友可以下载下来自行发掘。

The above is a simple example of continuing writing. For more prompts and creative ways to use the model, interested individuals can download it and explore further on their own.

## 引用 Citation

如果您在您的工作中使用了我们的模型，可以引用我们的[论文](https://arxiv.org/abs/2210.08590)：

If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2210.08590):

```text
@article{Ziya2,
  author    = {Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiaojun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang, Qi Yang, Jiaxing Zhang, Yan Song},
  title     = {Ziya2: Data-centric Learning is All LLMs Need},
  year      = {2023}
}
```

You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

欢迎引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

```text
@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}
```