--- license: mit language: - tr library_name: transformers --- # 🇹🇷 Turkish BERT Model for Software Engineering This repository was created within the scope of computer engineering undergraduate graduation project. This research aims to perform an exploratory case study to determine the functional dimensions of user requirements or use cases for software projects. In order to perform this task we created two models, [SE-BERT](https://huggingface.co/burakkececi/bert-software-engineering) and SE-BERTurk. You can find a detailed description of the project at the [link](https://github.com/burakkececi/software-size-estimation-nlp). # SE-BERTurk SE-BERT is a BERT model trained for domain adaptation in a software engineering context. We applied Masked Language Modeling (MLM), an unsupervised learning technique, for domain adaptation. MLM enhances the model understanding of domain-specific language by masking portions of the input text and training the model to predict the masked words based on the surrounding context. ## Stats Created a bilingual [SE corpus](https://drive.google.com/file/d/1IgnJTaR2-pe889TdQZtYF8SKOH92mi1l/view?usp=drive_link) (166Mb) ➡️ [Descriptive stats of the corpus](https://docs.google.com/spreadsheets/d/1Xnn_xfu4tdCtWg-nQ8ce_LHe9F-g0BSmUxzTdi5g1r4/edit?usp=sharing) * 166K entry = 886K sentence = 10M words * 156K training entry + 10K test entry * Each entry has a maximum length of 512 tokens The final training corpus has a size of 166MB and 10.554.750 words. ## MLM Training (Domain Adaptation) Used ``AdamW`` optimizer and set ``num_epochs = 1``, ``lr = 2e-5``, ``eps = 1e-8`` * For T4 GPU ➡️ Set ``batch_size = 6`` (13.5Gb memory) * For A100 GPU ➡️ Set ``batch_size = 50`` (37Gb memory) and ``fp16 = True`` **Perplexity** * ``3,665`` PPL for SE-BERTurk ### Evaluation Steps: 1) Calculate ``PPL`` (perplexity) on the test corpus (10K context with a maximum length of 512 tokens) 2) Calculate ``PPL`` (perplexity) on the requirement datasets 3) Evaluate performance on downstream tasks: * For size measurement ➡️ ``MAE``, ``MSE``, ``MMRE``, ``PRED(30)``, ``ACC`` ## Usage With Transformers >= 2.11 our SE-BERT uncased model can be loaded like: ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("burakkececi/bert-turkish-software-engineering/model") model = AutoModel.from_pretrained("burakkececi/bert-turkish-software-engineering/tokenizer") ``` # Huggingface model hub All models are available on the [Huggingface model hub](https://huggingface.co/burakkececi).