File size: 4,357 Bytes
5225c3c
91c6ecc
5225c3c
91c6ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: llama2
---
# NewHope: Harnessing 99% of GPT-4's Programming Capabilities

We introduce NewHope, a fine-tuned chat model based on llama-2-13b, aiming to provide a strong coding capability. NewHope handle different languages including Python, C++, Java, JavaScript, Go, and more. Preliminary evaluation on HumanEval shows that **NewHope possesses 99% of GPT-4's programming capabilities**.

**Contact**: SLAM (<ins>S</ins>UFE <ins>L</ins>arge <ins>A</ins>I <ins>M</ins>odel) is a research group at Shanghai University of Finance and Economics. 
cui.wanyun@sufe.edu.cn 

**TODO**: We will release more evaluatation results and training details later.

# Evaluation Results

We evaluated NewHope on [HumanEval](https://github.com/openai/human-eval) using the official evaluation script by OpenAI. We compared the Pass@1 metric of NewHope with other models. The results of other models are from PapersWithCode.

| Model | Pass@1 |
| ----- | ------ |
| **GPT-4** | **67.0**   |
| **NewHope** | **66.5**  | 
| PanGu-Coder2 15B | 61.6   |
| WizardCoder 15B | 57.3  |
| phi-1 1.3B | 50.6 |
| GPT-3.5 | 48.1 |
| phi-1-small | 45.0 |
| PaLM-Coder | 36.0 |
| CodeGeeX2-6B | 35.9 |

# Model Weights

We have open-sourced the model weights [NewHope](https://huggingface.co/SLAM-group/NewHope).

We are uploading the model weights. The weights will be available in a few hours.


# Usage

To load the NewHope model using Transformers, use the following code:
```
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

base_model = "SLAM-group/NewHope"
tokenizer = LlamaTokenizer.from_pretrained(base_model)
model = LlamaForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
# model.config.use_cache is default to `False`. For inference: `model.config.use_cache = True`
```
**Note:** At least Huggingface Transformers **4.31.0** is required to load this model!

You can ask NewHope to generate code with instructions. We provide a simple example of how NewHope model generates code with the specific prompt:
```
# Suppose required tokenizer and model have already been loaded

instruction = "Write a Python function to tell me what the date is today."
prompt = f"<s> ### Instruction:\n{instruction}\n\n### Response:\n"
inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**inputs, do_sample=True, top_p=0.9, max_new_tokens=2048)[0]
decoded_output = tokenizer.decode(output, skip_special_tokens=True).split("### Response:\n")[-1].strip()
print(decoded_output)
```

You can also interact with NewHope in a dialog manner with the following prompt:
```
<s> ### Instruction:\nQ1\n\n### Response:\nA1</s><s> ### Instruction:\nQ2\n\n### Response:\nA2</s>
```


# Evaluation

### Local setup
1. Install HumanEval for evaluation. [Details](https://github.com/openai/human-eval)
2. Install dependencies

   ```bash
   pip install -r requirements.txt
   ```

---
For HumanEval, we use the following prompt:
```
example_input = 'def is_odd(number: int) -> bool:\n    """ Check whether the given number is odd\n    >>> is_odd(3)\n    True\n    >>> is_odd(6)\n    False\n    """\n'
example_output = 'def is_odd(number: int) -> bool:\n    """ Check whether the given number is odd\n    >>> is_odd(3)\n    True\n    >>> is_odd(6)\n    False\n    """\n    return number % 2 == 1'

task_in_humaneval = "REPLACE `task_in_humaneval` WITH THE SPECIFIC TASK IN HUMANEVAL DATA"

prompt = f"<s> ### Instruction:\nComplete the given function below:\n\n{example_input}\n\n### Response:\n{example_output}</s><s> ### Instruction:\nComplete the given function below:\n\n{task_in_human_eval}\n\n### Response:\n"
```

To reproduce the results on HumanEval, use the following script:
```
python complete.py --base_model SLAM-group/NewHope --output_dir output --n_gpu 8
```
The above script will generate `samples.jsonl` in `output_dir`, which can be directly evaluated by HumanEval. [Evaluation procedure](https://github.com/openai/human-eval). We conducted the experiment with `fp16` on 8xA800, 80GB GPUs, reaching `66.5%` on Pass@1 (v.s. GPT4 `67.0%`).

# Citation

```
@misc{2023newhope,
    title={NewHope: Harnessing 99% of GPT-4's Programming Capabilities},
    author={Wanyun Cui and Qianle Wang},
    howpublished = https://github.com/SLAM-group/newhope,
    year={2023}
}
```