--- language: tr --- # Turkish Text Classification This model is a fine-tune model of https://github.com/stefan-it/turkish-bert by using text classification data where there are 7 categories as follows ``` code_to_label={ 'LABEL_0': 'dunya ', 'LABEL_1': 'ekonomi ', 'LABEL_2': 'kultur ', 'LABEL_3': 'saglik ', 'LABEL_4': 'siyaset ', 'LABEL_5': 'spor ', 'LABEL_6': 'teknoloji '} ``` ## Citation ``` @misc{yildirim2024finetuning, title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, author={Savas Yildirim}, year={2024}, eprint={2401.17396}, archivePrefix={arXiv}, primaryClass={cs.CL} } Please cite the following book if needed @book{yildirim2021mastering, title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques}, author={Yildirim, Savas and Asgari-Chenaghlu, Meysam}, year={2021}, publisher={Packt Publishing Ltd} } ``` ## Data The following Turkish benchmark dataset is used for fine-tuning https://www.kaggle.com/savasy/ttc4900 ## Quick Start Bewgin with installing transformers as follows > pip install transformers ``` # Code: # import libraries from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification tokenizer= AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification") # build and load model, it take time depending on your internet connection model= AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification") # make pipeline nlp=pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) # apply model nlp("bla bla") # [{'label': 'LABEL_2', 'score': 0.4753005802631378}] code_to_label={ 'LABEL_0': 'dunya ', 'LABEL_1': 'ekonomi ', 'LABEL_2': 'kultur ', 'LABEL_3': 'saglik ', 'LABEL_4': 'siyaset ', 'LABEL_5': 'spor ', 'LABEL_6': 'teknoloji '} code_to_label[nlp("bla bla")[0]['label']] # > 'kultur ' ``` ## How the model was trained ``` ## loading data for Turkish text classification import pandas as pd # https://www.kaggle.com/savasy/ttc4900 df=pd.read_csv("7allV03.csv") df.columns=["labels","text"] df.labels=pd.Categorical(df.labels) traind_df=... eval_df=... # model from simpletransformers.classification import ClassificationModel import torch,sklearn model_args = { "use_early_stopping": True, "early_stopping_delta": 0.01, "early_stopping_metric": "mcc", "early_stopping_metric_minimize": False, "early_stopping_patience": 5, "evaluate_during_training_steps": 1000, "fp16": False, "num_train_epochs":3 } model = ClassificationModel( "bert", "dbmdz/bert-base-turkish-cased", use_cuda=cuda_available, args=model_args, num_labels=7 ) model.train_model(train_df, acc=sklearn.metrics.accuracy_score) ``` For other training models please check https://simpletransformers.ai/ For the detailed usage of Turkish Text Classification please check [python notebook](https://github.com/savasy/TurkishTextClassification/blob/master/Bert_base_Text_Classification_for_Turkish.ipynb)