File size: 4,167 Bytes
d2ebb2a
898e57a
 
 
 
d2ebb2a
 
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
 
 
 
 
 
 
 
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
 
 
 
 
 
 
 
 
 
d2ebb2a
898e57a
d2ebb2a
898e57a
 
d2ebb2a
898e57a
 
 
 
 
d2ebb2a
898e57a
d2ebb2a
898e57a
 
 
 
 
 
 
 
d2ebb2a
898e57a
d2ebb2a
898e57a
 
 
 
 
 
 
 
 
 
 
d2ebb2a
898e57a
 
 
 
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
 
d2ebb2a
898e57a
 
 
 
 
 
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
 
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
d2ebb2a
898e57a
 
 
 
 
 
 
 
 
 
 
 
 
d2ebb2a
898e57a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language: code
thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
datasets:
- code_search_net
---

This is an *unofficial* reupload of [huggingface/CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) in the `SafeTensors` format using `transformers` `4.41.1`. The goal of this reupload is to prevent older models that are still relevant baselines from becoming stale as a result of changes in HuggingFace. Additionally, I may include minor corrections, such as model max length configuration.

Original model card below:

---

# CodeBERTa

CodeBERTa is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub.

Supported languages:

```shell
"go"
"java"
"javascript"
"php"
"python"
"ruby"
```

The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.

Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).

The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.

### Tensorboard for this training ⤵️

[![tb](https://cdn-media.huggingface.co/CodeBERTa/tensorboard.png)](https://tensorboard.dev/experiment/irRI7jXGQlqmlxXS0I07ew/#scalars)

## Quick start: masked language modeling prediction

```python
PHP_CODE = """
public static <mask> set(string $key, $value) {
	if (!in_array($key, self::$allowedKeys)) {
		throw new \InvalidArgumentException('Invalid key given');
	}
	self::$storedValues[$key] = $value;
}
""".lstrip()
```

### Does the model know how to complete simple PHP code?

```python
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="huggingface/CodeBERTa-small-v1",
    tokenizer="huggingface/CodeBERTa-small-v1"
)

fill_mask(PHP_CODE)

## Top 5 predictions:
# 
' function' # prob 0.9999827146530151
'function'  # 
' void'     # 
' def'      # 
' final'    # 
```

### Yes! That was easy 🎉 What about some Python (warning: this is going to be meta)

```python
PYTHON_CODE = """
def pipeline(
    task: str,
    model: Optional = None,
    framework: Optional[<mask>] = None,
    **kwargs
) -> Pipeline:
	pass
""".lstrip()
```

Results:
```python
'framework', 'Framework', ' framework', 'None', 'str'
```

> This program can auto-complete itself! 😱

### Just for fun, let's try to mask natural language (not code):

```python
fill_mask("My name is <mask>.")

# {'sequence': '<s> My name is undefined.</s>', 'score': 0.2548016905784607, 'token': 3353}
# {'sequence': '<s> My name is required.</s>', 'score': 0.07290805131196976, 'token': 2371}
# {'sequence': '<s> My name is null.</s>', 'score': 0.06323737651109695, 'token': 469}
# {'sequence': '<s> My name is name.</s>', 'score': 0.021919190883636475, 'token': 652}
# {'sequence': '<s> My name is disabled.</s>', 'score': 0.019681859761476517, 'token': 7434}
```

This (kind of) works because code contains comments (which contain natural language).

Of course, the most frequent name for a Computer scientist must be undefined 🤓.


## Downstream task: [programming language identification](https://huggingface.co/huggingface/CodeBERTa-language-id)

See the model card for **[`huggingface/CodeBERTa-language-id`](https://huggingface.co/huggingface/CodeBERTa-language-id)** 🤯.

<br>

## CodeSearchNet citation

<details>

```bibtex
@article{husain_codesearchnet_2019,
	title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
	shorttitle = {{CodeSearchNet} {Challenge}},
	url = {http://arxiv.org/abs/1909.09436},
	urldate = {2020-03-12},
	journal = {arXiv:1909.09436 [cs, stat]},
	author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
	month = sep,
	year = {2019},
	note = {arXiv: 1909.09436},
}
```

</details>