metadata
library_name: transformers
datasets:
- NamCyan/tesoro-code
base_model:
- bigcode/starcoder2-3b
Improving the detection of technical debt in Java source code with an enriched dataset
Model Details
Model Description
This model is the part of Tesoro project, used for detecting technical debt in source code. More information can be found at Tesoro HomePage.
- Developed by: Nam Hai Le
- Model type: Decoder-based PLMs
- Language(s): Java
- Finetuned from model: StarCoder2
Model Sources
- Repository: Tesoro
- Paper: [To be update]
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NamCyan/starcoder2-3b-technical-debt-code-tesoro")
model = AutoModelForSequenceClassification.from_pretrained("NamCyan/starcoder2-3b-technical-debt-code-tesoro")
Training Details
Training Data: The model is finetuned using tesoro-code
Infrastructure: Training process is conducted on two NVIDIA A100 GPUs with 80GB of VRAM.
Leaderboard
Model | Model size | EM | F1 |
---|---|---|---|
Encoder-based PLMs | |||
CodeBERT | 125M | 38.28 | 43.47 |
UniXCoder | 125M | 38.12 | 42.58 |
GraphCodeBERT | 125M | 39.38 | 44.21 |
RoBERTa | 125M | 35.37 | 38.22 |
ALBERT | 11.8M | 39.32 | 41.99 |
Encoder-Decoder-based PLMs | |||
PLBART | 140M | 36.85 | 39.90 |
Codet5 | 220M | 32.66 | 35.41 |
CodeT5+ | 220M | 37.91 | 41.96 |
Decoder-based PLMs (LLMs) | |||
TinyLlama | 1.03B | 37.05 | 40.05 |
DeepSeek-Coder | 1.28B | 42.52 | 46.19 |
OpenCodeInterpreter | 1.35B | 38.16 | 41.76 |
phi-2 | 2.78B | 37.92 | 41.57 |
starcoder2 | 3.03B | 35.37 | 41.77 |
CodeLlama | 6.74B | 34.14 | 38.16 |
Magicoder | 6.74B | 39.14 | 42.49 |
Citing us
@article{nam2024tesoro,
title={Improving the detection of technical debt in Java source code with an enriched dataset},
author={Hai, Nam Le and Bui, Anh M. T. Bui and Nguyen, Phuong T. and Ruscio, Davide Di and Kazman, Rick},
journal={},
year={2024}
}