loubnabnl HF staff commited on
Commit
f3d667f
1 Parent(s): 00dec8f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md CHANGED
@@ -1,3 +1,75 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - code
4
  license: apache-2.0
5
+ tags:
6
+ - code
7
+ - gpt2
8
+ - generation
9
+ datasets:
10
+ - "codeparrot/github-code-clean"
11
+ - "openai_humaneval"
12
+ metrics:
13
+ - "evaluate-metric/code_eval"
14
  ---
15
+
16
+ # CodeParrot-Multi 🦜 (small)
17
+
18
+ CodeParrot-Multi 🦜 is a GPT-2 model (110M parameters) trained to generate code in 32 programming languages (Python, Java, C, JavaScript...)
19
+
20
+ ## Usage
21
+
22
+ You can load the CodeParrot-Multi model and tokenizer directly in `transformers`:
23
+
24
+ ```Python
25
+ from transformers import AutoTokenizer, AutoModelWithLMHead
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small-multi")
28
+ model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small-multi")
29
+
30
+ inputs = tokenizer("def hello_world():", return_tensors="pt")
31
+ outputs = model(**inputs)
32
+ ```
33
+
34
+ or with a `pipeline`:
35
+
36
+ ```Python
37
+ from transformers import pipeline
38
+
39
+ pipe = pipeline("text-generation", model="codeparrot/codeparrot-small-multi")
40
+ outputs = pipe("def hello_world():")
41
+ ```
42
+
43
+ ## Training
44
+
45
+ The model was trained on the cleaned [Github code dataset](https://huggingface.co/datasets/codeparrot/github-code-clean) with the following settings:
46
+
47
+ |Config|Value|
48
+ |-------|-----|
49
+ |Batch size| 192 |
50
+ |Context size| 1024 |
51
+ |Training steps| 300'000|
52
+ |Gradient accumulation| 2|
53
+ |Gradient checkpointing| False|
54
+ |Learning rate| 5e-4 |
55
+ |Weight decay | 0.1 |
56
+ |Warmup steps| 2000 |
57
+ |Schedule| Cosine |
58
+
59
+ The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 58 billion tokens.
60
+
61
+ ## Performance
62
+
63
+ We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges:
64
+
65
+ | Metric | Value |
66
+ |-------|-----|
67
+ |pass@1 | --% |
68
+ |pass@10 | --% |
69
+ |pass@100 | --% |
70
+
71
+ The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests.
72
+
73
+ ## Resources
74
+
75
+ - Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot)