File size: 807 Bytes
02ba4f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# CodeColBERT
This model serves as the base for our semantic code retrieval system SELMA. It can be applied for indexing and retrieval using the Pyterrier bindings for ColBERT.
## Training Details
This model was trained for code retrieval. As a base, CodeBERT is used. It is trained using the official ColBERTv2 code
([Github](https://github.com/stanford-futuredata/ColBERT)).
Our data source is the [CodeSearchNet Challenge](https://github.com/github/CodeSearchNet).
Training ColBERT requires a tripes of queries, positive examples and negative examples. As queries, we used the documentation
provided for each sample in the CodeSearchNet data set, while its code snippet serves as the positive example. Negative examples were
sampled randomly from the corpus. In total, we train for 400.000 steps.
|