Automatic Speech Recognition
NeMo
PyTorch
Armenian
speech
audio
low-resource-languages
CTC
Conformer
Transformer
NeMo
Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Overview

This model is a fine-tuned version of the NVIDIA NeMo Conformer CTC large model, adapted for transcribing Armenian speech.

NVIDIA NeMo: Training

To train, fine-tune, or play with the model, you will need to install NVIDIA NeMo. We recommend installing it after you've installed the latest Pytorch version.

pip install nemo_toolkit['all']

How to Use this Model

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("Yeroyan/stt_arm_conformer_ctc_large")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

asr_model.transcribe(['2086-149220-0033.wav'])

Transcribing many audio files

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py  pretrained_name="Yeroyan/stt_arm_conformer_ctc_large"  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

The model uses a Conformer Convolutional Neural Network architecture with CTC loss for speech recognition.

Training

This model was originally trained on diverse English speech datasets and fine-tuned on a dataset comprising Armenian speech (100epochs)

Datasets

The model was fine-tuned on the Armenian dataset from the Common Voice corpus, version 17.0 (Mozilla Foundation). For dataset processing, we have used the following fork: NeMo-Speech-Data-Processor

Performance

Version Tokenizer Vocabulary Size MCV Test WER MCV Test WER (no punctuation) Train Dataset
1.6.0 SentencePiece 128 15.0% 12.44% MCV v17
Unigram (Armenian)

Limitations

  • Eastern Armenian
  • Need to replace "եւ" with "և" after each prediction (tokenizer does not contain "և" symbol which is unique linguistic exceptions as it does not have an uppercase version)

References

[1] NVIDIA NeMo Toolkit [2] Enhancing ASR on low-resource languages (paper)

Downloads last month
0

Datasets used to train Yeroyan/stt_arm_conformer_ctc_large