Kyrgyz Named Entity Recognition

Fine-tuning bert-base-multilingual-cased on Wikiann dataset for performing NER on Kyrgyz language.

WARNING: this model is not usable (see metrics below) and is built just as a proof of concept. I'll update the model after cleaning up the Wikiann dataset (ky part of it which contains only 100 train/test/valid items) or coming up with a completely new dataset.

Label ID and its corresponding label name

Label ID	Label Name
0	O
1	B-PER
2	I-PER
3	B-ORG
4	I-ORG
5	B-LOC
6	I-LOC

Results

Name	Overall F1	LOC F1	ORG F1	PER F1
Train set	0.595683	0.570312	0.687179	0.549180
Validation set	0.461333	0.551181	0.401913	0.425087
Test set	0.442622	0.456852	0.469565	0.413114

Example

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("murat/kyrgyz_language_NER")
model = AutoModelForTokenClassification.from_pretrained("murat/kyrgyz_language_NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Жусуп Мамай"
ner_results = nlp(example)
ner_results

murat
/

kyrgyz_language_NER

Kyrgyz Named Entity Recognition

Label ID and its corresponding label name

Results

Dataset used to train murat/kyrgyz_language_NER