antolin commited on
Commit
e68e221
1 Parent(s): 17505f5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - code_search_net
4
+ widget:
5
+ - text: "def <mask> ( a, b ) : if a > b : return a else return b</s>return the maximum value"
6
+ - text: "def <mask> ( a, b ) : if a > b : return a else return b"
7
+ ---
8
+
9
+ # Model Architecture
10
+
11
+ This model follows the distilroberta-base architecture. Futhermore, this model was initialized with the checkpoint of distilroberta-base.
12
+
13
+ # Pre-training phase
14
+
15
+ This model was pre-trained with the MLM objective (`mlm_probability=0.15`).
16
+
17
+ During this phase, the inputs had two formats. One is the following:
18
+ $$\left[[CLS], t_1, \dots, t_n, [SEP], w_1, \dots, w_m\right[EOS]]$$
19
+ where $t_1, \dots, t_n$ are the code tokens and $w_1, \dots, w_m$ are the natural language description tokens. More concretely, this is the snippet that tokenizes the input:
20
+ ```python
21
+ def tokenize_function_bimodal(examples, tokenizer, max_len):
22
+ codes = [' '.join(example) for example in examples['func_code_tokens']]
23
+ nls = [' '.join(example) for example in examples['func_documentation_tokens']]
24
+ pairs = [[c, nl] for c, nl in zip(codes, nls)]
25
+ return tokenizer(pairs, max_length=max_len, padding="max_length", truncation=True)
26
+ ```
27
+ The other format is:
28
+ $$\left[[CLS], t_1, \dots, t_n \right[EOS]]$$
29
+ where $t_1, \dots, t_n$ are the code tokens. More concretely, this is the snippet that tokenizes the input:
30
+
31
+ ```python
32
+ def tokenize_function_unimodal(examples, tokenizer, max_len, tokens_column):
33
+ codes = [' '.join(example) for example in examples[tokens_column]]
34
+ return tokenizer(codes, max_length=max_len, padding="max_length", truncation=True)
35
+ ```
36
+
37
+
38
+ # Training details
39
+
40
+ - Max length: 512
41
+ - Effective batch size: 64
42
+ - Total steps: 140000
43
+ - Learning rate: 5e-4
44
+
45
+ # Usage
46
+
47
+ ```python
48
+ model = AutoModelForMaskedLM.from_pretrained('antolin/distilroberta-base-csn-python-unimodal-bimodal')
49
+ tokenizer = AutoTokenizer.from_pretrained('antolin/distilroberta-base-csn-python-unimodal-bimodal')
50
+ mask_filler = pipeline("fill-mask", model=model, tokenizer=tokenizer)
51
+ code_tokens = ["def", "<mask>", "(", "a", ",", "b", ")", ":", "if", "a", ">", "b", ":", "return", "a", "else", "return", "b"]
52
+ nl_tokens = ["return", "the", "maximum", "value"]
53
+ input_text = ' '.join(code_tokens) + tokenizer.sep_token + ' '.join(nl_tokens)
54
+ pprint(mask_filler(input_text, top_k=5))
55
+ ```
56
+ ```shell
57
+ [{'score': 0.7177600860595703,
58
+ 'sequence': 'def maximum ( a, b ) : if a > b : return a else return breturn '
59
+ 'the maximum value',
60
+ 'token': 4532,
61
+ 'token_str': ' maximum'},
62
+ {'score': 0.22075247764587402,
63
+ 'sequence': 'def max ( a, b ) : if a > b : return a else return breturn the '
64
+ 'maximum value',
65
+ 'token': 19220,
66
+ 'token_str': ' max'},
67
+ {'score': 0.015111264772713184,
68
+ 'sequence': 'def minimum ( a, b ) : if a > b : return a else return breturn '
69
+ 'the maximum value',
70
+ 'token': 3527,
71
+ 'token_str': ' minimum'},
72
+ {'score': 0.007394665852189064,
73
+ 'sequence': 'def min ( a, b ) : if a > b : return a else return breturn the '
74
+ 'maximum value',
75
+ 'token': 5251,
76
+ 'token_str': ' min'},
77
+ {'score': 0.004020793363451958,
78
+ 'sequence': 'def length ( a, b ) : if a > b : return a else return breturn '
79
+ 'the maximum value',
80
+ 'token': 5933,
81
+ 'token_str': ' length'}]
82
+ ```