File size: 8,308 Bytes
4ce29b9
923db35
 
 
 
 
 
4ce29b9
923db35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: gpl-3.0
language:
- zh
- en
library_name: transformers
pipeline_tag: text-generation
---

# Ziya-Coding-15B-v1


# 姜子牙系列模型

- [Ziya-LLaMA-13B-v1.1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1.1)
- [Ziya-LLaMA-13B-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1)
- [Ziya-LLaMA-7B-Reward](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward)
- [Ziya-LLaMA-13B-Pretrain-v1](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-Pretrain-v1)
- [Ziya-BLIP2-14B-Visual-v1](https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1)
- [Ziya-Writing-LLaMa-13B-v1](https://huggingface.co/IDEA-CCNL/Ziya-Writing-LLaMa-13B-v1)

## 简介 Brief Introduction

姜子牙代码大模型V1是基于StarCoderBase的155亿参数的代码预训练模型,可以根据指令完成生成和修改代码、代码解释、代码续写、NL2SQL等一系列的代码相关任务。目前姜子牙代码大模型V1已完成大规模预训练、有监督微调的训练过程。

Ziya-Coding-15B-v1 is a pre-training model with 15.5 billion parameters based on StarCoderBase. It can complete a series of code-related tasks such as generating and modifying code, code interpretation, code continuation, NL2SQL, etc., according to instructions. Currently, Ziya-Writing-LLaMa-13B-v1 has completed the large-scale pre-training (PT), and supervised fine-tuning (SFT) training process.


更多细节可以参考我们的公众号文章:

[姜子牙大模型系列 | 代码模型ziya-coding发布!低成本微调即可学会在专有场景编程](https://mp.weixin.qq.com/s/tWaRF1wL3HM87ZDEawd2UA)

## 软件依赖
```
pip install torch==1.12.1 tokenizers==0.13.3 git+https://github.com/huggingface/transformers
```

## 模型分类 Model Taxonomy

|  需求 Demand  | 任务 Task       | 系列 Series      | 模型 Model    | 参数 Parameter | 额外 Extra |
|  :----:  | :----:  | :----:  | :----:  | :----:  | :----:  |
| 代码 Coding | AGI模型 | 姜子牙 Ziya | StarCoderBase |     15.5B    |     English&Chinese     |

## 模型信息 Model Information

### 继续预训练 Continual pretraining
由于StarCoderBase的训练数据基本为代码数据,因此其语言理解能力和指令遵循能力偏弱,特别是使用中文生成代码的场景下还远不可用。为利用它优秀的代码生成能力,并提升模型的中文语言理解能力,我们在自建的预训练语料中精选了中英文和代码共100Btoken的高质量语料,进行继续预训练。

在增量训练过程中,我们使用144张40GB的A100训练10天,batch_size是2.6M,使用FlashAttention和Multi-Query Attention等技术加速模型训练和减少显存占用,吞吐量达到139.8 TFLOPS。

Due to the fact that the training data for StarCoderBase is primarily code data, its language comprehension and command compliance capabilities are relatively weak, especially in scenarios where Chinese is used to generate code. To leverage its excellent code generation capabilities and enhance the model's Chinese language understanding capabilities, we have carefully selected high-quality corpus of 100B tokens from our self-built pre-training corpus, which includes Chinese, English, and code, for further pre-training.

During the incremental training process, we used 144 A100s with 40GB each for 10 days of training, with a batch size of 2.6M. We utilized technologies such as FlashAttention and Multi-Query Attention to accelerate model training and reduce GPU memory usage, achieving a throughput of 139.8 TFLOPS.

### 有监督微调 Supervised finetuning

我们收集并整理了大量的代码任务数据集,并根据规则和编译反馈进行严格清洗,构建了高质量的代码指令数据,数据中包含竞赛题、代码翻译、sql、代码解释、代码生成、代码知识问答等丰富的任务,保证了指令的多样性。

同时我们利用self-instruct、evol-instruct的方法,生成了更多的高质量通用指令数据。

我们进行了三个阶段的微调。在第一阶段中,我们使用了45万条中文通用数据(自建instruction数据集中采样)来训练模型以对齐人类意图。在第二阶段的有监督训练中,我们使用了中英文的代码指令数据来激发模型的代码能力。在第三阶段,我们利用编译反馈构建严格高质量的代码生成数据,进一步提升了生成的准确率。

We have collected and organized a large amount of code task datasets, and conducted strict cleaning based on rules and compilation feedback, constructing high-quality code instruction data. The data includes a rich variety of tasks such as competition questions, code translation, SQL, code interpretation, code generation, code knowledge Q&A, etc., ensuring the diversity of instructions.

At the same time, we have generated more high-quality general instruction data using the self-instruct and evol-instruct methods.

We conducted fine-tuning in three stages. In the first stage, we used 450,000 pieces of general Chinese data (sampled from our self-built instruction dataset) to train the model to align with human intentions. In the second stage of supervised training, we used Chinese and English code instruction data to stimulate the model's coding capabilities. In the third stage, we used compilation feedback to construct strictly high-quality code generation data, further improving the accuracy of generation.

### 效果评估 Performance

|  模型 Moldel  | HumanEval       | MBPP |
|  :----:  | :----:  | :----:  | 
| Ziya-Coding-15B-v1 | pass@1:50.1 pass@10:77.1 pass@100:91.4| pass@1:50.2 | 

其中,微调数据集中我们剔除了评测任务的数据集,避免数据泄露,HumanEval的pass@1的指标是贪婪生成的结果,
pass@10和pass@100是温度参数temperature=0.9下生成的结果。

In the fine-tuning dataset, we excluded the evaluation task dataset to avoid data leakage. The pass@1 metric for HumanEval is based on the results of greedy generation, while pass@10 and pass@100 are based on the results generated with a temperature parameter of 0.9.

## <span id="jump"> 使用 Usage </span>
```python3
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda")

prompt = "写一段快速排序"
model = AutoModelForCausalLM.from_pretrained("IDEA-CCNL/Ziya-Coding-15B-v1", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Ziya-Coding-15B-v1", use_fast=False)

pre_prompt = "The following is a conversation between a human and an artificial intelligence assistant developed by IDEA."
input = pre_prompt +  "<|Human|>:" + prompt + "<|Bot|>:"
       
input_ids = tokenizer(input, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(
            input_ids,
            max_new_tokens=512, 
            do_sample = True, 
            top_p = 0.85, 
            temperature = 1.0, 
            repetition_penalty=1., 
            eos_token_id=tokenizer.encode("<|end|>"), 
            )
output = tokenizer.batch_decode(generate_ids)[0]
print(output)
```

## 引用 Citation

如果您在您的工作中使用了我们的模型,可以引用我们的[论文](https://arxiv.org/abs/2210.08590):

If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2210.08590):

```text
@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}
```

You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

欢迎引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
```text
@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}
```