metadata

license: gpl-3.0
language:
  - zh
  - en
library_name: transformers
pipeline_tag: text-generation

Ziya-Coding-15B-v1

姜子牙系列模型

简介 Brief Introduction

姜子牙代码大模型V1是基于StarCoderBase的155亿参数的大规模预训练模型，可以根据指令完成生成和修改代码、代码解释、代码续写、NL2SQL等一系列的代码相关任务。目前姜子牙代码大模型V1已完成大规模预训练、有监督微调的训练过程。

Ziya-Coding-15B-v1 is a large-scale pre-training model with 15.5 billion parameters based on StarCoderBase. It can complete a series of code-related tasks such as generating and modifying code, code interpretation, code continuation, NL2SQL, etc., according to instructions. Currently, Ziya-Writing-LLaMa-13B-v1 has completed the large-scale pre-training (PT), and supervised fine-tuning (SFT) training process.

更多细节可以参考我们的公众号文章：

姜子牙大模型系列 | 代码模型ziya-coding发布！低成本微调即可学会在专有场景编程

软件依赖

pip install torch==1.12.1 tokenizers==0.13.3 git+https://github.com/huggingface/transformers

模型分类 Model Taxonomy

需求 Demand	任务 Task	系列 Series	模型 Model	参数 Parameter	额外 Extra
代码 Coding	AGI模型	姜子牙 Ziya	StarCoderBase	15.5B	English&Chinese

模型信息 Model Information

继续预训练 Continual pretraining

基座模型StarCoderBase是基于大型代码数据集The Stack (v1.2)训练而来，因此原始模型语言理解能力偏弱，为了提升模型的中文理解能力，我们在自建的预训练语料数据库中采样中英文和代码共100Btoken的混合语料，继续预训练。

在增量训练过程中，我们使用144张40GB的A100训练10天，batch_size是2.6M，使用FlashAttention和Multi-Query Attention等技术加速模型训练和减少显存占用，吞吐量达到139.8 TFLOP per GPU per second

The base model StarCoderBase is trained based on the large-scale code dataset The Stack (v1.2), so the original model's language comprehension ability is relatively weak. To enhance the model's understanding of Chinese, we sampled a mixed corpus of Chinese, English, and code totaling 100B tokens from our self-built pre-training corpus database for further pre-training.

During the incremental training process, we used 144 A100s with 40GB each for 10 days of training, with a batch size of 2.6M. We utilized technologies such as FlashAttention and Multi-Query Attention to accelerate model training and reduce GPU memory usage, achieving a throughput of 139.8 TFLOP per GPU per second.

有监督微调 Supervised finetuning

我们从网络中收集了所有公开的代码任务数据集，并进行严格清洗，构建出代码指令数据库，代码数据中包含竞赛题、代码翻译、sql、代码解释、代码的通用指令、stackoverflow等，保证了指令的多样性。

同时我们利用self-instruct、evol-instruct的方法，生成了更多的高质量通用指令数据。

微调分为两个阶段。在第一阶段中，我们使用了45万条中文通用数据（自建instruction数据集中采样）来训练模型以对齐人类意图。在第二阶段的有监督训练中，我们使用了中英文的代码指令数据来激发模型的代码能力。

We have collected all publicly available code task datasets from the internet and rigorously cleaned them to build a code instruction database. The code data includes competition questions, code translations, SQL, code explanations, general code instructions, StackOverflow, etc., ensuring the diversity of instructions.

At the same time, we have generated more high-quality general instruction data using the self-instruct and evol-instruct methods.

Fine-tuning is divided into two stages. In the first stage, we used 450,000 pieces of general Chinese data (sampled from our self-built instruction dataset) to train the model to align with human intentions. In the second stage of supervised training, we used Chinese and English code instruction data to stimulate the model's coding capabilities.

效果评估 Performance

模型 Moldel	HumanEval	MBPP
Ziya-Coding-15B-v1	pass@1:50.1 pass@10:77.1 pass@100:91.4	pass@1:50.2

其中，微调数据集中我们剔除了评测任务的数据集，避免数据泄露，HumanEval的pass@1的指标是在贪婪生成的结果， pass@10和pass@100是在温度参数temperature=0.9下生成的结果

In the fine-tuning dataset, we excluded the evaluation task dataset to avoid data leakage. The pass@1 metric for HumanEval is based on the results of greedy generation, while pass@10 and pass@100 are based on the results generated with a temperature parameter of 0.9.

使用 Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda")

prompt = "写一段快速排序"
model = AutoModelForCausalLM.from_pretrained("IDEA-CCNL/Ziya-Coding-15B-v1", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Ziya-Coding-15B-v1", use_fast=False)

pre_prompt = "The following is a conversation between a human and an artificial intelligence assistant developed by IDEA."
input = pre_prompt +  "<|Human|>:" + prompt + "<|Bot|>:"
       
input_ids = tokenizer(input, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(
            input_ids,
            max_new_tokens=512, 
            do_sample = True, 
            top_p = 0.85, 
            temperature = 1.0, 
            repetition_penalty=1., 
            eos_token_id=tokenizer.encode("<|end|>"), 
            )
output = tokenizer.batch_decode(generate_ids)[0]
print(output)

引用 Citation

如果您在您的工作中使用了我们的模型，可以引用我们的论文：

If you are using the resource for your work, please cite the our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

欢迎引用我们的网站:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}