File size: 3,515 Bytes
8de40b2
10aa5aa
 
 
3b3be01
 
 
 
 
7adce78
3b3be01
 
c079448
e88bb02
 
8de40b2
3b3be01
 
 
e88bb02
3b3be01
 
 
 
 
 
 
 
f096bbb
e88bb02
3b3be01
 
f096bbb
3b3be01
 
cbafc7a
 
3b3be01
 
 
 
 
 
 
 
 
 
 
 
f096bbb
020ebb7
e88bb02
f096bbb
 
 
 
e88bb02
f096bbb
 
 
 
 
3b3be01
 
 
 
7adce78
f096bbb
 
7adce78
f096bbb
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
library_name: transformers
tags:
- code
datasets:
- Leon-Leee/wizardlm_evol_instruct_v2_196K_backuped
- m-a-p/Code-Feedback
- openbmb/UltraInteract_sft
- ise-uiuc/Magicoder-Evol-Instruct-110K
- flytech/python-codes-25k
metrics:
- code_eval
pipeline_tag: text-generation
license: other
license name: deepseek
---
## AIGCodeGeek-DS-6.7B

### Introduction
AIGCodeGeek-DS-6.7B is our first released version of a Code-LLM family with competitive performance on public and private benchmarks.

### Model Details
#### Model Description
- Developed by: [Leon Li](https://huggingface.co/Leon-Leee)
- License: [DeepSeek](https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/LICENSE-MODEL)
- Fine-tuned from [deepseek-ai/deepseek-coder-6.7b-base](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base) with full parameters

### Training data
A mixture of samples from high-quality open-source (read *Acknowledgements*) and our private datasets. 
We have made contamination detection as Magicoder/Bigcode did (https://github.com/ise-uiuc/magicoder/blob/main/src/magicoder/decontamination/find_substrings.py).

### Evaluation
results to be added.

### Requirements
It should work with the same requirements as DeepSeek-Coder-6.7B or the following packages:

```torch>=2.0
tokenizers>=0.14.0
transformers>=4.35.0
accelerate
sympy>=1.12
pebble 
timeout-decorator 
attrdict
```


### QuickStart

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("aigcode/AIGCodeGeek-DS-6.7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("aigcode/AIGCodeGeek-DS-6.7B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
messages=[
    { 'role': 'user', 'content': "write a merge sort algorithm in python."}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
# tokenizer.eos_token_id is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
```


### Acknowledgements
We gain a lot of knowledge and resources from the open-source community:
- [DeepSeekCoder](https://huggingface.co/deepseek-ai): impressive model series and insightful tech reports
- [WizardCoder](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder): Evol Instruct and public datasets
  - We used a ([Leon-Leee/wizardlm_evol_instruct_v2_196K_backuped](https://huggingface.co/datasets/Leon-Leee/wizardlm_evol_instruct_v2_196K_backuped)) since this original has been deleted.
- [Magicoder](https://github.com/ise-uiuc/magicoder/): OSS-Instruct, [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K) from theblackcat102/evol-codealpaca-v1(https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)
- [Eurus](https://github.com/OpenBMB/Eurus): creative datasets for reasoning, [openbmb/UltraInteract_sft](https://huggingface.co/datasets/openbmb/UltraInteract_sft)
- [OpenCoderInterpreter](https://opencodeinterpreter.github.io/): well-designed system and datasets [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback)
- [flytech/python-codes-25k](https://huggingface.co/datasets/flytech/python-codes-25k): diversity
- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): easily used to finetune base models