juyongjiang's picture
Update README.md
a578f82
---
license: openrail++
language:
- en
tags:
- text-to-code
- multilingual-code-generation
---
<!-- <p align="center" width="70%">
<img src="assets/Logo.jpg" alt="HKUST CodeUp" style="width: 50%; min-width: 250px; display: block; margin: auto;">
</p> -->
![HKUST CodeUp](assets/Logo.jpg)
# CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-Efficient Instruction-Tuning on a Single RTX 3090
## Description
In recent years, large language models (LLMs) have shown exceptional capabilities in a wide range of applications due to their fantastic emergence ability. To align with human preference, instruction-tuning and reinforcement learning from human feedback (RLHF) are proposed for Chat-based LLMs (e.g., ChatGPT, GPT-4). However, these LLMs (except for Codex) primarily focus on the general domain and are not specifically designed for the code domain. Although Codex provides an alternative choice, it is a closed-source model developed by OpenAI. Hence, it is imperative to develop open-source instruction-following LLMs for the code domain.
However, the large-scale number of LLMs' parameters ($\ge$7B) and training datasets require a vast amount of computational resources, which significantly impedes the development of training and inference on consumer hardware.
To handle these challenges, in this project, we adopt the latest powerful foundation model `Llama 2` and construct high-quality instruction-following data for code generation tasks, and propose an instruction-following multilingual code generation Llama2 model. Meanwhile, to make it fit an academic budget and consumer hardware (e.g., a single RTX 3090) based on `Alpaca-LoRA`, we equip `CodeUp` with the advanced parameter-efficient fine-tuning (PEFT) methods (e.g., [LoRA](https://arxiv.org/abs/2106.09685)) which enable efficient adaptation of pre-trained language models (PLMs, also known as foundation model) to various downstream applications without fine-tuning the entire model's parameters. The overall training recipe is as follows.
![Training Framework](assets/Framework.jpg)
## NL2Code Data Release
Recently, it has attracted significant attention to exploiting much larger and more powerful LLMs (e.g., ChatGPT, GPT-4) to self-generate instruction-following data by delicate prompt design. However, many approaches primarily focus on the general domain and lack code-specific domain considerations. To this end, [Code Alpaca](https://github.com/sahil280114/codealpaca) follows the previous Self-Instruct paper [3] and [Stanford Alpaca repo](https://github.com/tatsu-lab/stanford_alpaca) with some code-related modifications to conduct 20K instruction-following data `data/code_alpaca_20k.json` for code generation tasks. This `JSON` file following `alpaca_data.json` format is a list of dictionaries; each dictionary contains the following fields:
- `instruction`: `str`, describes the task the model should perform. Each of the 20K instructions is unique.
- `input`: `str`, optional context or input for the task. For example, when the instruction is "Amend the following SQL query to select distinct elements", the input is the SQL query. Around 40% of the examples have an input.
- `output`: `str`, the answer to the instruction as generated by `text-davinci-003`.
### High-quality Data Filter
However, after carefully checking the LLMs-self-generated data, we observe three critical problems that may hinder LLMs' instruction learning due to ambiguous and irrelevant noise. That is
1. When `instruction` doesn't specify the programming language (PL) of implementation, the `output` appears with diverse options, e.g., Python, C++, and JavaScript.
2. It is ambiguous to identify which programming language `output` is implemented by.
3. Both `instruction` and `output` are irrelevant to the code-specific domain.
Hence, we filter the ambiguous and irrelevant data by rigorous design to obtain high-quality instruction data. Specifically, to solve 1) we set Python as the default PL of implementation and use [Guesslang](https://guesslang.readthedocs.io/en/latest/) package to detect the PL of a given source code in `output`. If the Python is detected, this prompt is retained. Otherwise, it will be filtered. 2) and 3) In these cases, we delete these prompts. After that, about 5K low-quality instruction data is filtered. To supplement the high-quality instruction data, we further integrate the `data/new_codealpaca.json` data (about 4.5K) under the above filter rules.
This way, we gain the 19K high-quality instruction data of code generation. The following is the instruction number distribution of each PL with Radar visualization before and after filtering.
<!-- | Raw Data (20K + 4K)| Filtered Data (19K) |
| -- | -- |
| <center><img src="assets/PL_Raw.png" width="100%"></center> | <center><img src="assets/PL_Clean.png" width="92%"></center> | -->
![PL Data Filtering)](assets/PL_Filter.jpg)
## Training & Inference
Detailed instructions can be found at [https://github.com/juyongjiang/CodeUp](https://github.com/juyongjiang/CodeUp).
## Citation
If you use the data or code in this repo, please cite the repo.
```
@misc{codeup,
author = {Juyong Jiang and Sunghun Kim},
title = {CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-Efficient Instruction-Tuning},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/juyongjiang/CodeUp}},
}
```
Naturally, you should also cite the original LLaMA V1 [1] & V2 paper [2], and the Self-Instruct paper [3], and the LoRA paper [4], and the [Stanford Alpaca repo](https://github.com/tatsu-lab/stanford_alpaca), and [Alpaca-LoRA repo](https://github.com/tloen/alpaca-lora), and [Code Alpaca repo](https://github.com/sahil280114/codealpaca), and [PEFT](https://github.com/huggingface/peft).