🇹🇷 Turkish BERT Model for Software Engineering

This repository was created within the scope of computer engineering undergraduate graduation project.

This research aims to perform an exploratory case study to determine the functional dimensions of user requirements or use cases for software projects. In order to perform this task we created two models, SE-BERT and SE-BERTurk.

You can find a detailed description of the project at the link.

SE-BERTurk

SE-BERT is a BERT model trained for domain adaptation in a software engineering context.

We applied Masked Language Modeling (MLM), an unsupervised learning technique, for domain adaptation. MLM enhances the model understanding of domain-specific language by masking portions of the input text and training the model to predict the masked words based on the surrounding context.

Stats

Created a bilingual SE corpus (166Mb) ➡️ Descriptive stats of the corpus

166K entry = 886K sentence = 10M words
156K training entry + 10K test entry
Each entry has a maximum length of 512 tokens

The final training corpus has a size of 166MB and 10.554.750 words.

MLM Training (Domain Adaptation)

Used AdamW optimizer and set num_epochs = 1, lr = 2e-5, eps = 1e-8

For T4 GPU ➡️ Set batch_size = 6 (13.5Gb memory)
For A100 GPU ➡️ Set batch_size = 50 (37Gb memory) and fp16 = True

Perplexity

3,665 PPL for SE-BERTurk

Evaluation Steps:

Calculate PPL (perplexity) on the test corpus (10K context with a maximum length of 512 tokens)
Calculate PPL (perplexity) on the requirement datasets
Evaluate performance on downstream tasks:

For size measurement ➡️ MAE, MSE, MMRE, PRED(30), ACC

Usage

With Transformers >= 2.11 our SE-BERT uncased model can be loaded like:

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("burakkececi/bert-turkish-software-engineering/model")
model = AutoModel.from_pretrained("burakkececi/bert-turkish-software-engineering/tokenizer")

Huggingface model hub

All models are available on the Huggingface model hub.