Ziya2-13B-Base / README.md

Update README.md

6d870ae 12 months ago

6.23 kB

	---
	tasks:

	- text-generation

	model_type:

	- gpt
	- llama

	domain:

	- nlp

	language:

	- en
	- zh
	- cn

	tags:
	- transformer
	- 封神榜
	---
	# Ziya2-13B-Base

	- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)
	- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)


	# 姜子牙系列模型

	- [Ziya-LLaMA-13B-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1)
	- [Ziya-LLaMA-7B-Reward](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward)
	- [Ziya-LLaMA-13B-Pretrain-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-Pretrain-v1)
	- [Ziya-BLIP2-14B-Visual-v1](https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1)

	## 简介 Brief Introduction

	Ziya2-13B-Base 是基于LLaMa2的130亿参数大规模预训练模型，针对中文分词优化，并完成了中英文 650B tokens 的增量预训练，进一步提升了中文生成和理解能力。

	The Ziya2-13B-Base is a large-scale pre-trained model based on LLaMA2 with 13 billion parameters. We optimizes LLaMAtokenizer on chinese, and incrementally train 650 billion tokens of data based on LLaMa2-13B model, which significantly improved the understanding and generation ability on Chinese.

	## 模型分类 Model Taxonomy

	\| 需求 Demand \| 任务 Task \| 系列 Series \| 模型 Model \| 参数 Parameter \| 额外 Extra \|
	\|:----------:\|:-------:\|:---------:\|:--------:\|:------------:\|:---------------:\|
	\| 通用 General \| AGI模型 \| 姜子牙 Ziya \| LLaMA2 \| 13B \| English&Chinese \|

	## 模型信息 Model Information

	### 继续预训练 Continual Pretraining

	Meta在2023年7月份发布了Llama2系列大模型，相比于LLaMA1的1.4万亿Token数据，Llama2预训练的Token达到了2万亿，并在各个榜单中明显超过LLaMA1。

	Meta released the Llama2 series of large models in July 2023, with pre-trained tokens reaching 200 billion compared to Llama1's 140 billion tokens, significantly outperforming Llama1 in various rankings.

	Ziya2-13B-Base沿用了Ziya-LLaMA-13B高效的中文编解码方式，但采取了更优化的初始化算法使得初始训练loss更低。同时，我们对Fengshen-PT继续训练框架进行了优化，效率方面，整合了FlashAttention2、Apex RMS norm等技术来帮助提升效率，对比Ziya-LLaMA-13B训练速度提升38%(163 TFLOPS/per gpu/per sec)。稳定性方面，我们采取BF16进行训练，修复了底层分布式框架的bug，确保模型能够持续稳定训练，解决了Ziya-LLaMA-13B遇到的训练后期不稳定的问题，并在7.25号进行了直播，最终完成了全部数据的继续训练。我们也发现，模型效果还有进一步提升的趋势，后续也会对Ziya2-13B-Base进行继续优化。

	Ziya2-13B-Base retained the efficient Chinese encoding and decoding techniques of Ziya-LLaMA-13B, but employed a more optimized initialization algorithm to achieve lower initial training loss. Additionally, we optimized the Fengshen-PT fine-tuning framework. In terms of efficiency, we integrated technologies such as FlashAttention2 and Apex RMS norm to boost efficiency, resulting in a 38% increase in training speed compared to Ziya-LLaMA-13B (163 TFLOPS per GPU per second). For stability, we used BF16 for training, fixed underlying distributed framework bugs to ensure consistent model training, and resolved the late-stage instability issues encountered in the training of Ziya-LLaMA-13B. We also conducted a live broadcast on July 25th to complete the continued training of all data. We have observed a trend towards further improvements in model performance and plan to continue optimizing Ziya2-13B-Base in the future.

	![loss曲线](./img2.png)

	### 效果评估 Performance

	Ziya2-13B-Base在Llama2-13B的基础上进行了约650B自建高质量中英文数据集的继续训练，在中文、英文、数学、代码等下游理解任务上相对于Llama2-13B取得了明显的提升，相对Ziya-LLaMA-13B也有明显的提升。

	The model Ziya2-13B-Base underwent further training on approximately 650 billion self-collected high-quality Chinese and English datasets, building upon the foundation of Llama2-13B. It achieved significant improvements in downstream comprehension tasks such as Chinese, English, mathematics, and code understanding, surpassing Llama2-13B and showing clear advancements compared to Ziya-LLaMA-13B.

	![效果评估](./img3.png)

	## 使用 Usage

	加载模型，进行的续写：

	Load the model and predicting：

	```python3
	from transformers import AutoTokenizer
	from transformers import LlamaForCausalLM
	import torch

	query="问题：我国的三皇五帝分别指的是谁？答案：
	model = LlamaForCausalLM.from_pretrained('IDEA-CCNL/Ziya2-13B-Base', torch_dtype=torch.float16, device_map="auto").eval()
	tokenizer = AutoTokenizer.from_pretrained(ckpt)
	input_ids = tokenizer(query, return_tensors="pt").input_ids.to('cuda:0')
	generate_ids = model.generate(
	input_ids,
	max_new_tokens=512,
	do_sample = True,
	top_p = 0.9)
	output = tokenizer.batch_decode(generate_ids)[0]
	print(output)
	```

	上面是简单的续写示例，其他更多prompt和玩法，感兴趣的朋友可以下载下来自行发掘。

	The above is a simple example of continuing writing. For more prompts and creative ways to use the model, interested individuals can download it and explore further on their own.

	## 引用 Citation

	如果您在您的工作中使用了我们的模型，可以引用我们的[论文](https://arxiv.org/abs/2210.08590)：

	If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2210.08590):

	```text
	@article{Ziya2,
	author = {Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiaojun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang, Qi Yang, Jiaxing Zhang, Yan Song},
	title = {Ziya2: Data-centric Learning is All LLMs Need},
	year = {2023}
	}
	```

	You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

	欢迎引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

	```text
	@misc{Fengshenbang-LM,
	title={Fengshenbang-LM},
	author={IDEA-CCNL},
	year={2021},
	howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
	}
	```