papluca
/

xlm-roberta-base-language-detection

Text Classification

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

xlm-roberta-base-language-detection / README.md

papluca's picture

Add dataset info

44a4fd0 over 2 years ago

|

raw history blame

No virus

1.77 kB

	---
	license: mit
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	- f1
	model-index:
	- name: xlm-roberta-base-language-detection
	results: []
	---

	# xlm-roberta-base-language-detection

	This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset.

	## Intended uses & limitations

	You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 20 languages:

	`arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)`

	## Training and evaluation data

	It achieves the following results on the evaluation set:
	- Loss: 0.0103
	- Accuracy: 0.9977
	- F1: 0.9977

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 64
	- eval_batch_size: 128
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 2
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| F1 \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|:------:\|
	\| 0.2492 \| 1.0 \| 1094 \| 0.0149 \| 0.9969 \| 0.9969 \|
	\| 0.0101 \| 2.0 \| 2188 \| 0.0103 \| 0.9977 \| 0.9977 \|


	### Framework versions

	- Transformers 4.12.5
	- Pytorch 1.10.0+cu111
	- Datasets 1.15.1
	- Tokenizers 0.10.3