---
language: en
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common-voice-vietnamese
      type: common_voice
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VIVOS
      type: vivos
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VLSP - Task 1
      type: vlsp
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
---

# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://your-paper-link)

### Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)
---

<a name = "description" ></a>
### Model Description
**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the innovative **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **2000 hours** of Vietnamese speech data sourced from diverse datasets.

<a name = "implementation" ></a>
### Documentation and Implementation
The [documentation](#) and [implementation](#) of ChunkFormer are publicly available.

<a name = "benchmark" ></a>
### Benchmark Results
| STT | Model        | Vios | Common Voice | VLSP - Task 1 | Avg. |
|-----|--------------|------|--------------|---------------|------|
| 1   | ChunkFormer  | x    | x            | x             | x    |
| 2   | PhoWhisper   | x    | x            | x             | x    |
| 3   | X            | x    | x            | x             | x    |
| 4   | Y            | x    | x            | x             | x    |

---


<a name = "usage" ></a>
### Usage

To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

1. **Download the ChunkFormer Repository**
Clone the ChunkFormer repository to your local machine:
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
```
2. **Download the Model Checkpoint from Hugging Face**
Download the model checkpoint from Hugging Face using the following git lfs command:
```bash
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vietnamese
```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

3. **Run the model**
Use the following command to transcribe long audio files:
```bash
python decode.py \
    --model_checkpoint path/to/chunkformer-large-vietnamese \
    --long_form_audio path/to/long_audio.wav \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```

---

<a name = "citation" ></a>
### Citation
If you use this work in your research, please cite:

```bibtex
@inproceedings{your_paper,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}
```

<a name = "contact"></a>
### Contact
- khanhld218@gmail.com
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)