Text Generation
Transformers
GGUF
English
Chinese
autoquant
Inference Endpoints
conversational
Volko76 commited on
Commit
2c3d5ce
Β·
verified Β·
1 Parent(s): 6029374

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ opencoder-1.5b-instruct.Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
37
+ opencoder-1.5b-instruct.bf16.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: inf
4
+ license_link: https://huggingface.co/infly/OpenCoder-1.5B-Instruct/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ - zh
8
+ base_model:
9
+ - infly/OpenCoder-1.5B-Base
10
+ pipeline_tag: text-generation
11
+ library_name: transformers
12
+ datasets:
13
+ - OpenCoder-LLM/opencoder-sft-stage1
14
+ - OpenCoder-LLM/opencoder-sft-stage2
15
+ tags:
16
+ - autoquant
17
+ - gguf
18
+ ---
19
+
20
+
21
+
22
+ <div align="center">
23
+ <img src="https://github.com/OpenCoder-llm/opencoder-llm.github.io/blob/main/static/images/opencoder_icon.jpg?raw=true" width="50%" alt="OpenCoder-Icon" />
24
+ </div>
25
+
26
+
27
+
28
+ <p align="center">
29
+ <!-- <a href="https://arxiv.org/pdf/2411.04905"><b>Paper Link</b>πŸ‘οΈ</a> -->
30
+ 🏠 <a href="https://opencoder-llm.github.io/">Home Page</a>&nbsp&nbsp |
31
+ &nbsp&nbsp πŸ€— <a href="https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e">Model</a>&nbsp&nbsp |
32
+ &nbsp&nbsp πŸ“Š <a href="https://huggingface.co/collections/OpenCoder-LLM/opencoder-datasets-672e6db6a0fed24bd69ef1c2">Dataset</a>&nbsp&nbsp |
33
+ &nbsp&nbsp πŸ“„<a href="https://arxiv.org/abs/2411.04905">Paper</a>&nbsp&nbsp |
34
+ &nbsp&nbsp πŸš€<a href="https://huggingface.co/spaces/OpenCoder-LLM/OpenCoder-1.5B-Instruct">Demo</a>&nbsp&nbsp
35
+ </p>
36
+
37
+
38
+ ## 1. Introduction
39
+
40
+ **OpenCoder** is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI.
41
+
42
+ - **Complete Open Source**: OpenCoder ensures full transparency by releasing not only the model weights and forthcoming inference code but also the complete data-cleaning code for training. This release includes high-quality synthetic data, an extensive set of checkpoints, and a dataset of over 4.5 million supervised fine-tuning (SFT) entries, making OpenCoder one of the most comprehensively open-sourced models available.
43
+ - **Comprehensive Experimental Analysis**: OpenCoder is rigorously tested through extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication experiments, ensuring thorough exploration and validation of the model’s performance.
44
+ - **High-Quality Synthetic Data**: OpenCoder provides a fully developed synthetic data generation process and over 4.5 million SFT data entries, establishing a robust data foundation for model training and evaluation.
45
+ - **Exceptional Performance**: OpenCoder achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.
46
+
47
+
48
+ ## 2. Models
49
+
50
+ | Model | Sequence Length | Download |
51
+ |:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
52
+ | OpenCoder-1.5B-Base | 4K | πŸ€— [HuggingFace](https://huggingface.co/infly/OpenCoder-1.5B-Base) |
53
+ | OpenCoder-8B-Base | 8K | πŸ€— [HuggingFace](https://huggingface.co/infly/OpenCoder-8B-Base) |
54
+ | OpenCoder-1.5B-Instruct | 4K | πŸ€— [HuggingFace](https://huggingface.co/infly/OpenCoder-1.5B-Instruct) |
55
+ | OpenCoder-8B-Instruct | 8K | πŸ€— [HuggingFace](https://huggingface.co/infly/OpenCoder-8B-Instruct) |
56
+
57
+ ## 3. Datasets
58
+
59
+ ### Pre-training
60
+
61
+ | Dataset | Size | Download |
62
+ |:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
63
+ | fineweb-code-corpus | 148 GB | πŸ€— [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus) |
64
+ | fineweb-math-corpus | 10 GB | πŸ€— [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus) |
65
+
66
+
67
+ ### Post-training
68
+
69
+ | Dataset | Num | Download |
70
+ |:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
71
+ | opencoder-sft-stage1 | 4.21 M | πŸ€— [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/opencoder-sft-stage1) |
72
+ | opencoder-sft-stage2 | 375 K | πŸ€— [HuggingFace](https://huggingface.co/datasets/OpenCoder-LLM/opencoder-sft-stage2) |
73
+
74
+ **This is not the end; we are organizing the remaining data and uploading it progressively.**
75
+
76
+ ## 4. Benchmarks
77
+
78
+ **Note:** For the detailed evaluation results, please refer to [our paper](https://arxiv.org/pdf/2411.04905).
79
+
80
+ <!-- ### Base Model -->
81
+ <!-- | model | OpenCoder-1.5B-Base | OpenCoder-8B-Base |
82
+ |:---------------:|:-------------:|:------------:|
83
+ | HumanEval(+) | 54.3 (49.4) | 66.5 (63.4) |
84
+ | MBPP(+) | 70.6 (58.7) | 79.9 (70.4) |
85
+ | BigCodeBench | 24.5 | 40.5 |
86
+ | BigCodeBench-Hard | 5.4 | 9.5 | -->
87
+
88
+
89
+ <!-- ### Chat Model -->
90
+ | model | OpenCoder-1.5B-Instruct | OpenCoder-8B-Instruct |
91
+ |:---------------:|:-------------:|:------------:|
92
+ | HumanEval(+) | 72.5 (67.7) | 83.5 (78.7) |
93
+ | MBPP(+) | 72.7 (61.9) | 79.1 (69.0) |
94
+ | BigCodeBench | 33.3 | 40.3 |
95
+ | BigCodeBench-Hard | 11.5 | 16.9 |
96
+ | LiveCodeBench | 12.8 | 23.2 |
97
+ | MultiPL-E (AVG) | 57.5 | 71.0 |
98
+
99
+
100
+ ## 5. Inference
101
+
102
+ ### Inference with Huggingface's Transformers
103
+
104
+ ```python
105
+ import torch
106
+ from transformers import AutoTokenizer, AutoModelForCausalLM
107
+
108
+ model_name = "infly/OpenCoder-1.5B-Instruct"
109
+ model = AutoModelForCausalLM.from_pretrained(model_name,
110
+ torch_dtype=torch.bfloat16,
111
+ device_map="auto",
112
+ trust_remote_code=True)
113
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
114
+
115
+ messages=[
116
+ { 'role': 'user', 'content': "write a quick sort algorithm in python."}
117
+ ]
118
+
119
+ inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
120
+
121
+ outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
122
+
123
+ result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
124
+ print(result)
125
+ ```
126
+
127
+ <!-- ### Inference with vLLM (recommended) -->
128
+
129
+ ## 6. License
130
+
131
+ OpenCoder series (including Base and Chat) support commercial applications under a permissive [License](https://huggingface.co/infly/OpenCoder-1.5B-Instruct/blob/main/LICENSE).
132
+
133
+ ## 7. Citation
134
+ ```
135
+ @inproceedings{Huang2024OpenCoderTO,
136
+ title={OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
137
+ author={Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
138
+ year={2024},
139
+ url={https://arxiv.org/pdf/2411.04905}
140
+ }
141
+ ```
opencoder-1.5b-instruct.Q2_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d13a225c9c69585efdb815462c9676aec6d4f2b6e6d23b69efe2e437280a34ca
3
+ size 1138904512
opencoder-1.5b-instruct.bf16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60d3c2f1514f87b46270ee80128b2095e9084c60247a26176ae379d0346af4fb
3
+ size 3813751232