--- license: cc-by-4.0 language: - ca - es base_model: - nvidia/stt_es_conformer_transducer_large tags: - automatic-speech-recognition - NeMo model-index: - name: stt_ca-es_conformer_transducer_large results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: CV Benchmark Catalan Accents type: projecte-aina/commonvoice_benchmark_catalan_accents config: ca split: test args: language: ca metrics: - name: Test WER type: wer value: 2.503 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Mozilla Common Voice 17.0 type: mozilla-foundation/common_voice_17_0 config: ca split: test args: language: es metrics: - name: Test WER type: wer value: 3.88 --- # NVIDIA Conformer-Transducer Large (ca-es) ## Table of Contents
Click to expand - [Model Description](#model-description) - [Intended Uses and Limitations](#intended-uses-and-limitations) - [How to Get Started with the Model](#how-to-get-started-with-the-model) - [Training Details](#training-details) - [Citation](#citation) - [Additional Information](#additional-information)
## Summary The "stt_ca-es_conformer_transducer_large" is an acoustic model based on ["NVIDIA/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large/) suitable for Bilingual Catalan-Spanisg Automatic Speech Recognition. ## Model Description This model transcribes speech in lowercase Catalan and Spanish alphabet including spaces, and was Fine-tuned on a Bilingual ca-es dataset comprising of 7426 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters. See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details. ## Intended Uses and Limitations This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text without punctuation. ### Installation To use this model, Install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest Pytorch version. ``` pip install nemo_toolkit['all'] ``` ### For Inference To transcribe audio in Catalan or in Spanish language using this model, you can follow this example: ```python import nemo.collections.asr as nemo_asr nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model) transcription = nemo_asr_model.transcribe([audio_path])[0][0] print(transcription) ``` ## Training Details ### Training data The model was trained on bilingual datasets in Catalan and Spanish, for a total of 7426 hours. ### Training procedure This model is the result of finetuning the base model ["Nvidia/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large) by following this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb). ## Citation If this model contributes to your research, please cite the work: ```bibtex @misc{mena2024whisperlarge3catparla, title={Bilingual ca-es ASR Model: stt_ca-es_conformer_transducer_large.}, author={Messaoudi, Abir; Külebi, Baybars}, organization={Barcelona Supercomputing Center}, url={https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large}, year={2024} } ``` ## Additional Information ### Author The fine-tuning process was performed during 2024 in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Abir Messaoudi](https://huggingface.co/AbirMessaoudi). ### Contact For further information, please send an email to . ### Copyright Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. ### License [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) ### Funding This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.