--- language: en datasets: - legacy-datasets/common_voice - vlsp2020_vinai_100h - AILAB-VNUHCM/vivos - doof-ferb/vlsp2020_vinai_100h - doof-ferb/fpt_fosd - doof-ferb/infore1_25hours - linhtran92/viet_bud500 - doof-ferb/LSVSC - doof-ferb/vais1000 - doof-ferb/VietMed_labeled - NhutP/VSV-1100 - doof-ferb/Speech-MASSIVE_vie - doof-ferb/BibleMMS_vie - capleaf/viVoice metrics: - wer pipeline_tag: automatic-speech-recognition tags: - transcription - audio - speech - chunkformer - asr - automatic-speech-recognition - long-form license: cc-by-nc-4.0 model-index: - name: ChunkFormer Large Vietnamese results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: common-voice-vietnamese type: common_voice args: vi metrics: - name: Test WER type: wer value: x - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: VIVOS type: vivos args: vi metrics: - name: Test WER type: wer value: x - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: VLSP - Task 1 type: vlsp args: vi metrics: - name: Test WER type: wer value: x --- # **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition** [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer) [![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://your-paper-link) ### Table of contents 1. [Model Description](#description) 2. [Documentation and Implementation](#implementation) 3. [Benchmark Results](#benchmark) 4. [Usage](#usage) 6. [Citation](#citation) 7. [Contact](#contact) --- ### Model Description **ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the innovative **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **2000 hours** of Vietnamese speech data sourced from diverse datasets. ### Documentation and Implementation The [documentation](#) and [implementation](#) of ChunkFormer are publicly available. ### Benchmark Results | STT | Model | Vios | Common Voice | VLSP - Task 1 | Avg. | |-----|--------------|------|--------------|---------------|------| | 1 | ChunkFormer | x | x | x | x | | 2 | PhoWhisper | x | x | x | x | | 3 | X | x | x | x | x | | 4 | Y | x | x | x | x | --- ### Usage To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps: 1. **Download the ChunkFormer Repository** Clone the ChunkFormer repository to your local machine: ```bash git clone https://github.com/khanld/chunkformer.git cd chunkformer pip install -r requirements.txt ``` 2. **Download the Model Checkpoint from Hugging Face** Download the model checkpoint from Hugging Face using the following git lfs command: ```bash git lfs install git clone https://huggingface.co/khanhld/chunkformer-large-vietnamese ``` This will download the model checkpoint to the checkpoints folder inside your chunkformer directory. 3. **Run the model** Use the following command to transcribe long audio files: ```bash python decode.py \ --model_checkpoint path/to/chunkformer-large-vietnamese \ --long_form_audio path/to/long_audio.wav \ --chunk_size 64 \ --left_context_size 128 \ --right_context_size 128 ``` --- ### Citation If you use this work in your research, please cite: ```bibtex @inproceedings{your_paper, title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau}, booktitle={ICASSP}, year={2025} } ``` ### Contact - khanhld218@gmail.com - [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/) - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)