nikitast
/

multilang-classifier-roberta

Text Classification

language classification

Inference Endpoints

Model card Files Files and versions Community

Edit model card

RoBERTa for Multilabel Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier

data source	language
open_subtitles	ka, he, en, de
oscar	be, kk, az, hu
tatoeba	ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Training Loss	Validation Loss	F1-Score	Roc Auc	Accuracy	Support
0.161500	0.110949	0.947844	0.953939	0.762063	26858

Downloads last month: 44

Inference Examples

Text Classification

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train nikitast/multilang-classifier-roberta