|
# CodeColBERT |
|
|
|
This model serves as the base for our semantic code retrieval system SELMA. It can be applied for indexing and retrieval using the Pyterrier bindings for ColBERT. |
|
|
|
## Training Details |
|
This model was trained for code retrieval. As a base, CodeBERT is used. It is trained using the official ColBERTv2 code |
|
([Github](https://github.com/stanford-futuredata/ColBERT)). |
|
|
|
Our data source is the [CodeSearchNet Challenge](https://github.com/github/CodeSearchNet). |
|
Training ColBERT requires a tripes of queries, positive examples and negative examples. As queries, we used the documentation |
|
provided for each sample in the CodeSearchNet data set, while its code snippet serves as the positive example. Negative examples were |
|
sampled randomly from the corpus. In total, we train for 400.000 steps. |
|
|