Add better infilling documentation
Browse files
README.md
CHANGED
@@ -40,7 +40,7 @@ CarperAI will be releasing larger LMs better tuned for code in the near future,
|
|
40 |
| \\(n_{heads}\\) | 16 |
|
41 |
| \\(d_{head}\\) | 128 |
|
42 |
| \\(n_{ctx}\\) | 2048 |
|
43 |
-
| \\(n_{vocab}\\) |
|
44 |
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
|
45 |
|
46 |
|
@@ -105,27 +105,59 @@ language model output is generated after \<MID\> token!
|
|
105 |
|
106 |
As a concrete example, here is a code snippet that should allow a model to perform infilling:
|
107 |
|
108 |
-
|
109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
|
111 |
-
|
112 |
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
-
|
115 |
-
|
|
|
116 |
|
117 |
-
|
118 |
-
|
|
|
|
|
|
|
|
|
119 |
|
|
|
|
|
|
|
120 |
|
121 |
-
|
|
|
122 |
|
123 |
-
|
|
|
|
|
124 |
|
|
|
|
|
|
|
125 |
|
|
|
|
|
|
|
|
|
|
|
126 |
```
|
|
|
|
|
127 |
|
128 |
-
|
129 |
|
130 |
|
131 |
## Intended Uses and Limitations
|
|
|
40 |
| \\(n_{heads}\\) | 16 |
|
41 |
| \\(d_{head}\\) | 128 |
|
42 |
| \\(n_{ctx}\\) | 2048 |
|
43 |
+
| \\(n_{vocab}\\) | 50280 |
|
44 |
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
|
45 |
|
46 |
|
|
|
105 |
|
106 |
As a concrete example, here is a code snippet that should allow a model to perform infilling:
|
107 |
|
108 |
+
There was an issue where the sentinel `<|SUF|>`, `<|PRE|>`, and `<|MID|>` tokens were not the correct ids in the uploaded tokenizer and model card! Please try clearing the Huggingface cache and redownloading the model :))
|
109 |
|
110 |
+
Here is a minimal example of performing open-ended generation with this model, on a simple function `score(x, y)`:
|
111 |
+
```
|
112 |
+
def score(x,y) -> int:
|
113 |
+
"""
|
114 |
+
|
115 |
+
```
|
116 |
|
117 |
+
and also infilling with the function and end of docstring already placed:
|
118 |
|
119 |
+
```
|
120 |
+
def score(x,y) -> int:
|
121 |
+
"""
|
122 |
+
<|MID|> (infill here)
|
123 |
+
"""
|
124 |
|
125 |
+
score = x + y
|
126 |
+
return score
|
127 |
+
```
|
128 |
|
129 |
+
```
|
130 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
131 |
+
import torch
|
132 |
+
|
133 |
+
model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
|
134 |
+
tok = AutoTokenizer.from_pretrained("CarperAI/
|
135 |
|
136 |
+
# infilling demo
|
137 |
+
prefix = 'def score(x, y) -> int:\n"""\n'
|
138 |
+
suffix = '"""\n\n score = x + y\n return score'
|
139 |
|
140 |
+
model_input = [50277, *tok(suffix)["input_ids"], 50278, *tok(prefix)["input_ids"], 50279]
|
141 |
+
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=40)[0])
|
142 |
|
143 |
+
print(output)
|
144 |
+
```
|
145 |
+
outputs: `'<|SUF|>"""\n\n score = x + y\n return score<|PRE|>def score(x, y) -> int:\n"""\n<|MID|> score(x, y) -> int\n<|endoftext|>'`
|
146 |
|
147 |
+
```
|
148 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
149 |
+
import torch
|
150 |
|
151 |
+
# non-infilling demo
|
152 |
+
prefix = 'def score(x, y) -> int:\n"""\n'
|
153 |
+
model_input = [*tok(prefix)["input_ids"]]
|
154 |
+
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=100)[0])
|
155 |
+
print(output)
|
156 |
```
|
157 |
+
outputs: `'def score(x, y) -> int:\n"""\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y_list))\n\ndef get_point_score(x, y) -> int:\n """\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y'`
|
158 |
+
|
159 |
|
160 |
+
The sentinel tokens are now accessible via `tokenizer.decode(50277) = "<|SUF|>"`, `tokenizer.decode(50278) = "<|PRE|>"`, `tokenizer.decode(50279) = "<|MID|>"`.
|
161 |
|
162 |
|
163 |
## Intended Uses and Limitations
|