Edit model card

CodeBERTaPy

CodeBERTaPy is a RoBERTa-like model trained on the CodeSearchNet dataset from GitHub for python by Manuel Romero

The tokenizer is a Byte-level BPE tokenizer trained on the corpus using Hugging Face tokenizers.

Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).

The (small) model is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full python corpus for 4 epochs.

Quick start: masked language modeling prediction

PYTHON_CODE = """
fruits = ['apples', 'bananas', 'oranges']
for idx, <mask> in enumerate(fruits):
  print("index is %d and value is %s" % (idx, val))
""".lstrip()

Does the model know how to complete simple Python code?

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="mrm8488/CodeBERTaPy",
    tokenizer="mrm8488/CodeBERTaPy"
)

fill_mask(PYTHON_CODE)

## Top 5 predictions:

'val' # prob  0.980728805065155
'value'
'idx'
',val'
'_'

Yes! That was easy πŸŽ‰ Let's try with another Flask like example

PYTHON_CODE2 = """
@app.route('/<name>')
def hello_name(name):
    return "Hello {}!".format(<mask>)

if __name__ == '__main__':
    app.run()
""".lstrip()


fill_mask(PYTHON_CODE2)

## Top 5 predictions:

'name' # prob  0.9961813688278198
' name'
'url'
'description'
'self'

Yeah! It works πŸŽ‰ Let's try with another Tensorflow/Keras like example

PYTHON_CODE3="""
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.<mask>(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])
""".lstrip()


fill_mask(PYTHON_CODE3)

## Top 5 predictions:

'Dense' # prob   0.4482928514480591
'relu'
'Flatten'
'Activation'
'Conv'

Great! πŸŽ‰

This work is heavily inspired on CodeBERTa by huggingface team


CodeSearchNet citation

@article{husain_codesearchnet_2019,
    title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
    shorttitle = {{CodeSearchNet} {Challenge}},
    url = {http://arxiv.org/abs/1909.09436},
    urldate = {2020-03-12},
    journal = {arXiv:1909.09436 [cs, stat]},
    author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
    month = sep,
    year = {2019},
    note = {arXiv: 1909.09436},
}

Created by Manuel Romero/@mrm8488

Made with β™₯ in Spain

Downloads last month
32
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.