metadata

language: en
datasets:
  - legacy-datasets/common_voice
  - vlsp2020_vinai_100h
  - AILAB-VNUHCM/vivos
  - doof-ferb/vlsp2020_vinai_100h
  - doof-ferb/fpt_fosd
  - doof-ferb/infore1_25hours
  - linhtran92/viet_bud500
  - doof-ferb/LSVSC
  - doof-ferb/vais1000
  - doof-ferb/VietMed_labeled
  - NhutP/VSV-1100
  - doof-ferb/Speech-MASSIVE_vie
  - doof-ferb/BibleMMS_vie
  - capleaf/viVoice
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - transcription
  - audio
  - speech
  - chunkformer
  - asr
  - automatic-speech-recognition
  - long-form
license: cc-by-nc-4.0
model-index:
  - name: ChunkFormer Large Vietnamese
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: common-voice-vietnamese
          type: common_voice
          args: vi
        metrics:
          - name: Test WER
            type: wer
            value: x
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: VIVOS
          type: vivos
          args: vi
        metrics:
          - name: Test WER
            type: wer
            value: x
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: VLSP - Task 1
          type: vlsp
          args: vi
        metrics:
          - name: Test WER
            type: wer
            value: x

ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition

Model Description
Documentation and Implementation
Benchmark Results
Usage
Citation
Contact

Model Description

ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the innovative ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 2000 hours of Vietnamese speech data sourced from diverse datasets.

Documentation and Implementation

The documentation and implementation of ChunkFormer are publicly available.

Benchmark Results

STT	Model	Vios	Common Voice	VLSP - Task 1	Avg.
1	ChunkFormer	x	x	x	x
2	PhoWhisper	x	x	x	x
3	X	x	x	x	x
4	Y	x	x	x	x

Usage

To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

Download the ChunkFormer Repository Clone the ChunkFormer repository to your local machine:

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

Download the Model Checkpoint from Hugging Face Download the model checkpoint from Hugging Face using the following git lfs command:

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vietnamese

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

Run the model Use the following command to transcribe long audio files:

python decode.py \
    --model_checkpoint path/to/chunkformer-large-vietnamese \
    --long_form_audio path/to/long_audio.wav \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Citation

If you use this work in your research, please cite:

@inproceedings{your_paper,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}

Contact

khanhld218@gmail.com

khanhld
/

chunkformer-large-vie