Edit model card

Model Architecture

This model follows the distilroberta-base architecture. Futhermore, this model was initialized with the checkpoint of distilroberta-base.

Pre-training phase

This model was pre-trained with the MLM objective (mlm_probability=0.15).

During this phase, the inputs had the following format: [[CLS],t1,,tn,[SEP],w1,,wm[EOS]]\left[[CLS], t_1, \dots, t_n, [SEP], w_1, \dots, w_m\right[EOS]] where $t_1, \dots, t_n$ are the code tokens and $w_1, \dots, w_m$ are the natural language description tokens. More concretely, this is the snippet that tokenizes the input:

def tokenize_function_bimodal(examples, tokenizer, max_len):
    codes = [' '.join(example) for example in examples['func_code_tokens']]
    nls = [' '.join(example) for example in examples['func_documentation_tokens']]
    pairs = [[c, nl] for c, nl in zip(codes, nls)]
    return tokenizer(pairs, max_length=max_len, padding="max_length", truncation=True)

Training details

  • Max length: 512
  • Effective batch size: 64
  • Total steps: 60000
  • Learning rate: 5e-4

Usage

model = AutoModelForMaskedLM.from_pretrained('antolin/distilroberta-base-csn-python-bimodal')
tokenizer = AutoTokenizer.from_pretrained('antolin/distilroberta-base-csn-python-bimodal')
mask_filler = pipeline("fill-mask", model=model, tokenizer=tokenizer)
code_tokens = ["def", "<mask>", "(", "a", ",", "b", ")", ":", "if", "a", ">", "b", ":", "return", "a", "else", "return", "b"]
nl_tokens = ["return", "the", "maximum", "value"]
input_text = ' '.join(code_tokens) + tokenizer.sep_token + ' '.join(nl_tokens)
pprint(mask_filler(input_text, top_k=5))
[{'score': 0.4645618796348572,
  'sequence': 'def max ( a, b ) : if a > b : return a else return b  return '
              'the maximum value',
  'token': 19220,
  'token_str': ' max'},
 {'score': 0.40963634848594666,
  'sequence': 'def maximum ( a, b ) : if a > b : return a else return b  '
              'return the maximum value',
  'token': 4532,
  'token_str': ' maximum'},
 {'score': 0.02103462442755699,
  'sequence': 'def min ( a, b ) : if a > b : return a else return b  return '
              'the maximum value',
  'token': 5251,
  'token_str': ' min'},
 {'score': 0.014217409305274487,
  'sequence': 'def value ( a, b ) : if a > b : return a else return b  return '
              'the maximum value',
  'token': 923,
  'token_str': ' value'},
 {'score': 0.010762304067611694,
  'sequence': 'def minimum ( a, b ) : if a > b : return a else return b  '
              'return the maximum value',
  'token': 3527,
  'token_str': ' minimum'}]
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train antolin/distilroberta-base-csn-python-bimodal