savasy
/

bert-turkish-text-classification

Text Classification

Inference Endpoints

Model card Files Files and versions Community

bert-turkish-text-classification / README.md

savasy's picture

Update README.md

c488635 verified 5 months ago

|

raw history blame contribute delete

No virus

3.16 kB

	---
	language: tr
	---

	# Turkish Text Classification

	This model is a fine-tune model of https://github.com/stefan-it/turkish-bert by using text classification data where there are 7 categories as follows

	```
	code_to_label={
	'LABEL_0': 'dunya ',
	'LABEL_1': 'ekonomi ',
	'LABEL_2': 'kultur ',
	'LABEL_3': 'saglik ',
	'LABEL_4': 'siyaset ',
	'LABEL_5': 'spor ',
	'LABEL_6': 'teknoloji '}

	```
	## Citation
	Please cite the following papers if needed
	```
	@misc{yildirim2024finetuning,
	title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks},
	author={Savas Yildirim},
	year={2024},
	eprint={2401.17396},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}




	@book{yildirim2021mastering,
	title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
	author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
	year={2021},
	publisher={Packt Publishing Ltd}
	}

	```

	## Data
	The following Turkish benchmark dataset is used for fine-tuning

	https://www.kaggle.com/savasy/ttc4900

	## Quick Start

	Bewgin with installing transformers as follows
	> pip install transformers

	```
	# Code:
	# import libraries
	from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
	tokenizer= AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")

	# build and load model, it take time depending on your internet connection
	model= AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")

	# make pipeline
	nlp=pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

	# apply model
	nlp("bla bla")
	# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]

	code_to_label={
	'LABEL_0': 'dunya ',
	'LABEL_1': 'ekonomi ',
	'LABEL_2': 'kultur ',
	'LABEL_3': 'saglik ',
	'LABEL_4': 'siyaset ',
	'LABEL_5': 'spor ',
	'LABEL_6': 'teknoloji '}

	code_to_label[nlp("bla bla")[0]['label']]
	# > 'kultur '
	```

	## How the model was trained

	```

	## loading data for Turkish text classification
	import pandas as pd
	# https://www.kaggle.com/savasy/ttc4900
	df=pd.read_csv("7allV03.csv")
	df.columns=["labels","text"]
	df.labels=pd.Categorical(df.labels)

	traind_df=...
	eval_df=...

	# model
	from simpletransformers.classification import ClassificationModel
	import torch,sklearn

	model_args = {
	"use_early_stopping": True,
	"early_stopping_delta": 0.01,
	"early_stopping_metric": "mcc",
	"early_stopping_metric_minimize": False,
	"early_stopping_patience": 5,
	"evaluate_during_training_steps": 1000,
	"fp16": False,
	"num_train_epochs":3
	}

	model = ClassificationModel(
	"bert",
	"dbmdz/bert-base-turkish-cased",
	use_cuda=cuda_available,
	args=model_args,
	num_labels=7
	)
	model.train_model(train_df, acc=sklearn.metrics.accuracy_score)
	```
	For other training models please check https://simpletransformers.ai/


	For the detailed usage of Turkish Text Classification please check [python notebook](https://github.com/savasy/TurkishTextClassification/blob/master/Bert_base_Text_Classification_for_Turkish.ipynb)