MatCaviar commited on
Commit
b554c5a
β€’
1 Parent(s): beb8cb7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -1,3 +1,54 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
6
+
7
+ [[πŸ€— HuggingFace](https://huggingface.co/internlm/AlchemistCoder-DS-6.7B)]
8
+ [[πŸ“ƒ Paper](https://arxiv.org/abs/xxxxx)]
9
+ [[🌐 Project Page](https://internlm.github.io/AlchemistCoder/)]
10
+
11
+
12
+ ## ✨ Highlights
13
+ > **Abstract:** *Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.*
14
+
15
+ - **AlchemistPrompts**: Designed as data-specific prompts for harmonizing inherent conflicts in multi-source data and mitigating the instruction/response misalignment at a fined-grained level.
16
+ - **Code Comprehenstion Tasks**: Sourced from the process of data construction, consisting of instruction evolution, data filtering, and code review.
17
+ - **Harmonized Multi-source Data**: Instruction tuned on 200M tokens, including 6 types of high-quality data.
18
+ - **Superior Model Performance**: Surpassing all the open-source models of the same size (6.7/7B), and rivaling or even beating larger models (15B/33B/70B/ChatGPT) on 6 code benchmarks.
19
+ - **Advanced generic capabilities**: Demonstrated by the significant improvements on MMLU, BBH, and GSM8K.
20
+
21
+
22
+ ## πŸš€ Quick Start
23
+ ```python
24
+ import torch
25
+ from transformers import AutoModelForCausalLM, AutoTokenizer
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained("internlm/AlchemistCoder-CL-7B", trust_remote_code=True)
28
+ model = AutoModelForCausalLM.from_pretrained("internlm/AlchemistCoder-CL-7B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
29
+ model = model.eval()
30
+
31
+ input_text = "Implement the Dijkstra algorithm in Python"
32
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
33
+ outputs = model.generate(**inputs, max_length=128)
34
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
35
+ ```
36
+
37
+
38
+ ## πŸ§ͺ Evaluation and Fine-tune
39
+ Please refer to [**AlchemistCoder**](https://github.com/InternLM/AlchemistCoder) and [**InternLM**](https://github.com/InternLM/InternLM/tree/main).
40
+
41
+ ## πŸ˜ƒ Acknowledgments
42
+ *AlchemistCoder* is built with [**InternLM**](https://github.com/InternLM) and [**OpenCompass**](https://github.com/open-compass). Thanks for their awesome work!
43
+
44
+ ## πŸ“§ Contact
45
+ If you have any questions, please create an issue on this repository or contact us at:
46
+ - sugger@tongji.edu.cn
47
+ - zhangwenwei@pjlab.org.cn
48
+
49
+ ## 🌟 Citation
50
+ If you find our work useful, please consider citing:
51
+
52
+ ```bibtex
53
+
54
+ ```