lvwerra HF staff commited on
Commit
bf8ac45
1 Parent(s): 26708d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -6
README.md CHANGED
@@ -195,15 +195,18 @@ model-index:
195
  2. [Use](#use)
196
  3. [Limitations](#limitations)
197
  4. [Training](#training)
198
- 5. [Citation](#citation)
 
199
 
200
  # Model Summary
201
 
202
- The SantaCoder models are a series of 1B parameter models trained on Python, Java, and JavaScript. They were trained on datasets with different filter parameters and with architecture and objective variations. The main model uses multi-query attention, was trained using near-deduplication and commnent-to-code ratio as filtering criteria and using the Fill-in-the-Middle objective.
 
 
203
 
204
  - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
205
- - **Project Website:** [bigcode-project.org]www.bigcode-project.org)
206
- - **Paper:** [Coming soon]()
207
  - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
208
  - **Languages:** Python, Java, and JavaScript
209
 
@@ -224,7 +227,8 @@ The `dedup-alt-comments` model is the best performing model and was trained twic
224
 
225
  ## Intended use
226
 
227
-
 
228
 
229
  **Feel free to share your generations in the Community tab!**
230
 
@@ -269,7 +273,7 @@ model = AutoModelForCausalLM.from_pretrained(
269
 
270
  ### Attribution
271
 
272
- The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset which requires attribution. We provide a [search index](TODO) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
273
 
274
  # Limitations
275
 
@@ -296,6 +300,8 @@ The model has been trained on source code in Python, Java, and JavaScript. The p
296
  - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
297
  - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
298
 
 
 
299
 
300
  # Citation
301
  **TODO**
 
195
  2. [Use](#use)
196
  3. [Limitations](#limitations)
197
  4. [Training](#training)
198
+ 5. [License](#license)
199
+ 6. [Citation](#citation)
200
 
201
  # Model Summary
202
 
203
+ The SantaCoder models are a series of 1B parameter models trained on the Python, Java, and JavaScript subset of [The Stack (v1.1)](https://huggingface.co/datasets/bigcode/the-stack) (which excluded opt-out requests).
204
+ The main model uses multi-query attention, was trained using near-deduplication and comment-to-code ratio as filtering criteria and using the Fill-in-the-Middle objective.
205
+ In addition there are several models that were trained on datasets with different filter parameters and with architecture and objective variations.
206
 
207
  - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
208
+ - **Project Website:** [bigcode-project.org](www.bigcode-project.org)
209
+ - **Paper:** [🎅SantaCoder: Don't reach for the stars!🌟]()
210
  - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
211
  - **Languages:** Python, Java, and JavaScript
212
 
 
227
 
228
  ## Intended use
229
 
230
+ The model was trained on GitHub code. As such it is _not_ an instruction model and commands like "Write a function that computes the square root." do not work well.
231
+ You should phrase commands like they occur in source code such as comments (e.g. `# the following function computes the sqrt`) or write a function signature and docstring and let the model complete the function body.
232
 
233
  **Feel free to share your generations in the Community tab!**
234
 
 
273
 
274
  ### Attribution
275
 
276
+ The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset which requires attribution. We provide a [search index](https://huggingface.co/spaces/bigcode/santacoder-search) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
277
 
278
  # Limitations
279
 
 
300
  - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
301
  - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
302
 
303
+ # License
304
+ The model is licenses under the CodeML Open RAIL-M v0.1 license. You can find the full license [here](https://huggingface.co/spaces/bigcode/license).
305
 
306
  # Citation
307
  **TODO**