teknium commited on
Commit
fa60a9a
1 Parent(s): 5e393c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -128
README.md CHANGED
@@ -2,8 +2,14 @@
2
  license: cc-by-sa-4.0
3
  datasets:
4
  - bigcode/the-stack-dedup
 
 
 
 
5
  tags:
6
  - code
 
 
7
  language:
8
  - code
9
  programming_language:
@@ -27,152 +33,52 @@ programming_language:
27
  - Jupyter Notebook
28
  - R
29
  - Shell
30
- model-index:
31
- - name: replit-code-v1-3b
32
- results:
33
- - task:
34
- name: Code Generation
35
- type: code-generation
36
- dataset:
37
- name: "HumanEval"
38
- type: openai_humaneval
39
- metrics:
40
- - name: pass@1
41
- type: pass@1
42
- value: 0.219
43
- verified: false
44
  ---
45
 
 
46
 
47
- # replit-code-v1-3b
48
- Developed by: Replit, Inc.
49
 
50
- [**🧑‍💻 Test it on our Demo Space! 🧑‍💻**](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)
 
 
51
 
52
- ## Model Description
53
- `replit-code-v1-3b` is a 2.7B Causal Language Model focused on **Code Completion**. The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533).
54
 
55
- The training mixture includes **20 different languages**, listed here in descending order of number of tokens:
56
- <br/>
57
- `Markdown`, `Java`, `JavaScript`, `Python`, `TypeScript`, `PHP`, `SQL`, `JSX`, `reStructuredText`, `Rust`, `C`, `CSS`, `Go`, `C++`, `HTML`, `Vue`, `Ruby`, `Jupyter Notebook`, `R`, `Shell`
58
- <br/>
59
- In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).
60
 
61
- The model has been trained on the [MosaicML](https://www.mosaicml.com/) platform with 256 x A100-40GB GPUs, leveraging their latest [LLM examples repo](https://github.com/mosaicml/examples/tree/release/v0.0.4/examples/llm).
62
- <br/>
63
- `replit-code-v1-3b` is powered by state-of-the-art LLM techniques, such as:
64
- [Flash Attention](https://arxiv.org/abs/2205.14135) for fast training and inference,
65
- [AliBi positional embeddings](https://arxiv.org/abs/2108.12409) to support variable context length at inference time,
66
- [LionW optimizer](https://arxiv.org/abs/2302.06675),
67
- etc.
68
-
69
- ## Intended Use
70
- Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.
71
-
72
- ## Limitations
73
- The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.
74
-
75
- ## License
76
- The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.
77
-
78
- The source code files (`*.py`) are licensed under the Apache 2.0 license.
79
-
80
- ## Contact
81
- For questions and comments about the model, please post in the community section.
82
-
83
- ## How to Use
84
- First of all, you need to install the latest versions of the following dependencies:
85
- ```
86
- einops
87
- sentencepiece
88
- torch
89
- transformers
90
  ```
 
 
91
 
92
- You can then load the model as follows:
93
- ```python
94
- from transformers import AutoModelForCausalLM
95
 
96
- # load model
97
- model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
98
  ```
99
 
100
- To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:
101
- ```
102
- flash-attn==0.2.8
103
- triton==2.0.0.dev20221202
104
- ```
105
-
106
- Then, move the model to `bfloat16` and use it as follows:
107
- ```python
108
- from transformers import AutoModelForCausalLM
109
-
110
- # load model
111
- model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True, attn_impl='triton')
112
- model.to(device='cuda:0', dtype=torch.bfloat16)
113
-
114
- # forward pass
115
- x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
116
- x = x.to(device='cuda:0')
117
- y = model(x)
118
 
119
  ```
 
 
120
 
121
- Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the
122
- [Transformers](https://huggingface.co/docs/transformers/index) library.
123
-
124
- ### Tokenizer
125
-
126
- We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.
127
 
128
- Note that using this requires the `sentencepiece` library to be installed.
129
-
130
- The tokenizer can be used as follows:
131
-
132
- ```python
133
- from transformers import AutoTokenizer
134
-
135
- # load tokenizer
136
- tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
137
-
138
- # single input encoding + generation
139
- x = tokenizer.encode('def hello():\n print("hello world")\n', return_tensors='pt')
140
- y = model.generate(x)
141
-
142
- # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
143
- generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
144
- print(generated_code)
145
  ```
146
 
147
- Note that:
148
- - `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library.
149
- - `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.
150
-
151
-
152
- ### Generation
153
 
154
- You can generate code using the `transformers` library as follows:
 
 
 
155
 
156
- ```python
157
- from transformers import AutoModelForCausalLM, AutoTokenizer
158
-
159
- tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
160
- model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
161
-
162
- x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
163
- y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
164
-
165
- # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
166
- generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
167
- print(generated_code)
168
  ```
169
-
170
- Experiment with different decoding methods and parameters to get the best results for your use case.
171
-
172
- ### Post Processing
173
-
174
- Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:
175
- - stop generation when the EOS token is encountered
176
- - remove trailing whitespaces
177
- - set `max_tokens` to a reasonable value based on your completion use case
178
- - truncate generation to stop words such as `return`, `def`, "```", "`\n\n\n`" to avoid generating incomplete code when `max_tokens` is larger than the length of the expected generated code.
 
2
  license: cc-by-sa-4.0
3
  datasets:
4
  - bigcode/the-stack-dedup
5
+ - sahil2801/CodeAlpaca-20k
6
+ - teknium/GPTeacher-CodeInstruct
7
+ model-base:
8
+ - replit/replit-code-v1-3b
9
  tags:
10
  - code
11
+ - instruct
12
+ - self instruct
13
  language:
14
  - code
15
  programming_language:
 
33
  - Jupyter Notebook
34
  - R
35
  - Shell
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ---
37
 
38
+ Base Model: replit/replit-code-v1-3b
39
 
40
+ This model is fine tuned on both Sahil2801's CodeAlpaca & Teknium's GPTeacher Code-Instruct to give Replit's Code model instruct capabilities.
 
41
 
42
+ Dataset links:
43
+ CodeAlpaca: https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
44
+ GPTeacher subset - Code Instruct: https://github.com/teknium1/GPTeacher
45
 
46
+ This model was trained on 2x a100 80gb for 1 hour on ~25,000 code instruction/response pairs in Alpaca format.
 
47
 
48
+ Refer to the base models HuggingFace model card for some basic requirements to run: https://huggingface.co/replit/replit-code-v1-3b
 
 
 
 
49
 
50
+ This fine tune can be prompted like any alpaca fine tune:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ```
52
+ ### Instruction:
53
+ <prompt>
54
 
55
+ ### Input:
56
+ <additional context>
 
57
 
58
+ ### Response:
 
59
  ```
60
 
61
+ or
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ```
64
+ ### Instruction:
65
+ <prompt>
66
 
67
+ ### Response:
 
 
 
 
 
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ```
70
 
71
+ This model for me produced coherent outputs with the following sampler settings, but feel free to experiment:
72
+ ```
73
+ max_new_tokens=128, do_sample=True, use_cache=True, temperature=0.2, top_p=0.9, eos_token_id= self.tokenizer.eos_token_id
74
+ ```
 
 
75
 
76
+ In the tokenizer decode arguments, it also needs these settings:
77
+ ```
78
+ skip_special_tokens=True, clean_up_tokenization_space=False
79
+ ```
80
 
81
+ The following parameters were used with HuggingFace trainer to train the model with:
 
 
 
 
 
 
 
 
 
 
 
82
  ```
83
+ --model_name_or_path replit/replit-code-v1-3b --data_path /root/stanford_alpaca/train.json --bf16 True --output_dir /root/stanford_alpaca/model_ckpts --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --save_strategy steps --save_steps 200 --save_total_limit 3 --learning_rate 1e-5 --weight_decay 0. --warmup_ratio 0.03 --tf32 True --run_name Replit1
84
+ ```