mms-common_voice_13_0-eo-1, an Esperanto speech recognizer

This model is a fine-tuned version of patrickvonplaten/mms-300m on the the mozilla-foundation/common_voice_13_0 Esperanto dataset. It achieves the following results on the evaluation set:

Loss: 0.2257
Cer: 0.0209
Wer: 0.0678

While the training loss is lower, this model does not perform significantly better than xekri/wav2vec2-common_voice_13_0-eo-3.

The first 10 samples in the test set:

Actual Predicted	CER
`la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo` `la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo`	0.0
`en la sekva jaro li ricevis premion` `en la sekva jaro li ricevis premion`	0.0
`ŝi studis historion ĉe la universitato de brita kolumbio` `ŝi studis historion ĉe la universitato de brita kolumbio`	0.0
`larĝaj ŝtupoj kuras al la fasado` `larĝaj ŝtupoj kuras al la fasado`	0.0
`la municipo ĝuas duan epokon de etendo kaj disvolviĝo` `la municipo ĝuas duan epokon de etendo kaj disvolviĝo`	0.0
`li estis ankaŭ katedrestro kaj dekano` `li estis ankaŭ katedresto kaj dekano`	0.02702702702702703
`librovendejo apartenas al la muzeo` `librovendejo apartenas al la muzeo`	0.0
`ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj` `ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj`	0.02702702702702703
`unue ili estas ruĝaj poste brunaj` `unue ili estas ruĝaj poste brunaj`	0.0
`la loĝantaro laboras en la proksima ĉefurbo` `la loĝantaro laboras en la proksima ĉefurbo`	0.0

Model description

See patrickvonplaten/mms-300m, or equivalently, facebook/wav2vec2-large-xlsr-53, as it seems to me that the only difference is that the speech front-end was trained with more languages and data in the mms-300m checkpoint.

Intended uses & limitations

Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.

Training and evaluation data

The training split was set to train[:15000] while the eval split was set to validation[:1500].

Training procedure

The same as xekri/wav2vec2-common_voice_13_0-eo-3.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
layerdrop: 0.1
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 100
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Cer	Validation Loss	Wer
2.3129	2.13	1000	0.0580	0.5042	0.2703
0.2251	4.27	2000	0.0295	0.1782	0.1198
0.1462	6.4	3000	0.0265	0.1635	0.1019
0.1162	8.53	4000	0.0248	0.1619	0.0931
0.0988	10.67	5000	0.0249	0.1654	0.0940
0.0904	12.8	6000	0.0242	0.1702	0.0845
0.0813	14.93	7000	0.0239	0.1658	0.0846
0.074	17.09	8000	0.0240	0.1763	0.0793
0.0692	19.22	9000	0.0243	0.1768	0.0835
0.0652	21.36	10000	0.0237	0.1812	0.0797
0.0593	23.5	11000	0.0221	0.1810	0.0750
0.0547	25.63	12000	0.0233	0.1835	0.0794
0.0514	27.76	13000	0.0224	0.1828	0.0761
0.0488	29.9	14000	0.0224	0.1844	0.0766
0.0478	32.03	15000	0.0226	0.1910	0.0769
0.0459	34.16	16000	0.0239	0.1965	0.0831
0.0429	36.3	17000	0.0220	0.2000	0.0760
0.0443	38.43	18000	0.0228	0.2039	0.0774
0.0398	40.56	19000	0.0219	0.1981	0.0755
0.0408	42.7	20000	0.0239	0.2053	0.0776
0.0406	44.83	21000	0.0221	0.2050	0.0740
0.0383	46.96	22000	0.0224	0.2128	0.0733
0.0379	49.1	23000	0.0220	0.2110	0.0731
0.0369	51.23	24000	0.0220	0.2145	0.0745
0.0341	53.36	25000	0.0222	0.2146	0.0725
0.0322	55.5	26000	0.0216	0.2130	0.0710
0.0316	57.63	27000	0.0222	0.2134	0.0716
0.0324	59.76	28000	0.0222	0.2172	0.0731
0.0315	61.9	29000	0.0228	0.2207	0.0745
0.0294	64.03	30000	0.0218	0.2183	0.0717
0.028	66.16	31000	0.0214	0.2185	0.0696
0.0263	68.3	32000	0.0215	0.2167	0.0696
0.0299	70.43	33000	0.0217	0.2201	0.0709
0.0273	72.56	34000	0.0222	0.2164	0.0724
0.0269	74.7	35000	0.0220	0.2240	0.0693
0.0264	76.92	36000	0.2220	0.0218	0.0704
0.0257	79.05	37000	0.2229	0.0217	0.0688
0.0251	81.19	38000	0.2263	0.0215	0.0694
0.0245	83.32	39000	0.2253	0.0210	0.0673
0.0243	85.45	40000	0.2264	0.0215	0.0692
0.0236	87.59	41000	0.2261	0.0217	0.0689
0.0225	89.72	42000	0.2265	0.0212	0.0680
0.023	91.85	43000	0.2265	0.0210	0.0674
0.0217	93.99	44000	0.2265	0.0209	0.0677
0.022	96.12	45000	0.2254	0.0211	0.0685
0.0219	98.25	46000	0.2262	0.0208	0.0672

Framework versions

Transformers 4.29.1
Pytorch 2.0.1+cu118
Datasets 2.12.0
Tokenizers 0.13.3

xekri
/

mms-common_voice_13_0-eo-1