File size: 4,639 Bytes

067ad5b
 
d4f4590
 
 
 
6793265
2bb1ec0
 
6793265
 
 
 
 
8657a3e
87cb1dc
 
8657a3e
30fe112
 
ab49101
43b8501
10e968a
 
 
 
6793265
 
 
 
ab49101
d0be2d1
ab49101
d0be2d1
 
ab49101
87cb1dc
ab49101
d0be2d1
 
 
 
ab49101
1bb676c
 
ab49101
 
22b22bd
 
 
 
 
 
 
 
 
ab49101
 
22b22bd
 
38bb1e5
 
b369553
982e531
 
b369553
 
38bb1e5
5d3b81e

---
license: cc-by-nc-4.0
language:
- bn
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
- ASR
- Automatic Speech Recognition
- Bangla ASR
- Bengali ASR
- bn asr
- Bangla fastconformer
- https://arxiv.org/abs/2311.03196
---
## Summary
__titu_stt_bn_fastconformer__ is a [fastconformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#fast-conformer) based model trained on ~18K Hours [MegaBNSpeech]() corpus.

Details on paper: [https://aclanthology.org/2023.banglalp-1.16/](https://aclanthology.org/2023.banglalp-1.16/)

## Using method
This model can be used for transcribing Bangla audio and also can be used as pre-trained model to fine-tuning on custom datasets using [NeMo](https://github.com/NVIDIA/NeMo) framework.

### Installation
To install [NeMo](https://github.com/NVIDIA/NeMo) check NeMo documentation.

```
pip install -q 'nemo_toolkit[asr]'
```

### Inferencing
[Download test_bn_fastconformer.wav](https://huggingface.co/hishab/hishab_bn_fastconformer/blob/main/test_bn_fastconformer.wav)
```py
# pip install -q 'nemo_toolkit[asr]'

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("hishab/titu_stt_bn_fastconformer")

auido_file = "test_bn_fastconformer.wav"
transcriptions = asr_model.transcribe([auido_file])
print(transcriptions)
# ['আজ সরকারি ছুটির দিন দেশের সব শিক্ষা প্রতিষ্ঠান সহ সরকারি আধা সরকারি স্বায়ত্তশাসিত প্রতিষ্ঠান ও ভবনে জাতীয় পতাকা অর্ধনমিত ও কালো পতাকা উত্তোলন করা হয়েছে']
```
Colab Notebook for Infer: [Bangla FastConformer Infer.ipynb](https://colab.research.google.com/drive/1J3bxXlLBgSf1zOKVKbRYu1VrbEJFLlUc?usp=sharing)

## Training Datasets

| Channels Category | Hours       |
| ----------------- | ----------- |
| News             | 17,640.00   |
| Talkshow         | 688.82      |
| Vlog             | 0.02        |
| Crime Show       | 4.08        |
| Total            | 18,332.92   |


## Training Details

For training the model, the dataset we selected comprises 17.64k hours of news chan- nel content, 688.82 hours of talk shows, 0.02 hours of vlogs, and 4.08 hours of crime shows.

## Evaluation


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64df9253cccd823564c3303b/WvMlp95z2-GXT6AYfwW8Y.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64df9253cccd823564c3303b/O2RA9TAedIv1OTqgdIap5.png)

## Citation
```
@inproceedings{nandi-etal-2023-pseudo,
    title = "Pseudo-Labeling for Domain-Agnostic {B}angla Automatic Speech Recognition",
    author = "Nandi, Rabindra Nath  and
      Menon, Mehadi  and
      Muntasir, Tareq  and
      Sarker, Sagor  and
      Muhtaseem, Quazi Sarwar  and
      Islam, Md. Tariqul  and
      Chowdhury, Shammur  and
      Alam, Firoj",
    editor = "Alam, Firoj  and
      Kar, Sudipta  and
      Chowdhury, Shammur Absar  and
      Sadeque, Farig  and
      Amin, Ruhul",
    booktitle = "Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.banglalp-1.16",
    doi = "10.18653/v1/2023.banglalp-1.16",
    pages = "152--162",
    abstract = "One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR",
}
```