metadata
library_name: nemo
CHiME8 DASR NeMo Baseline Models
- The model files in this repository are the models used in this paper The CHiME-7 Challenge: System Description and Performance of NeMo Team’s DASR System.
- These models are needed to execute the CHiME8-DASR baseline CHiME8-DASR-Baseline NeMo
- VAD, Diarization and ASR models are all based on NVIDIA NeMo Conversational AI Toolkits.
1. Voice Activity Detection (VAD) Model:
MarbleNet_frame_VAD_chime7_Acrobat.nemo
- This model is based on NeMo MarbleNet VAD model.
- For validation, we use dataset comprises the CHiME-6 development subset as well as 50 hours of simulated audio data.
- The simulated data is generated using the NeMo multi-speaker data simulator on VoxCeleb1&2 datasets
- The multi-speaker data simulation results in a total of 2,000 hours of audio, of which approximately 30% is silence.
- The Model training incorporates SpecAugment and noise augmentation through MUSAN noise dataset.
2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)
MSDD_v2_PALO_100ms_intrpl_3scales.nemo
Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD).
- MSDD Reference: Park et al. (2022)
- MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor.
- TitaNet Reference: Koluguri et al. (2022)
- TitaNet Model is included in MSDD-v2 .nemo checkpoint file.
- Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384.
- This neural model generates logit values indicating speaker existence.
- Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets.
- LibriSpeech Reference: OpenSLR Download,LibriSpeech, Panayotov et al. (2015)
- MUSAN noise is also used for adding additive background noise, focusing on music and broadband noise.
3. Automatic Speech Recognition (ASR) model
FastConformerXL-RNNT-chime7-GSS-finetuned.nemo
- This ASR model is based on NeMo FastConformer XL model.
- Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model.
- Model Reference: Gulati et al. (2020)
- The model was initialized using a publicly available NeMo checkpoint.
- NeMo Checkpoint: NGC Model Card: Conformer Transducer XL
- This model was then fine-tuned on the CHiME-7 train and dev set, which includes the CHiME-6 and Mixer6 training subsets, after processing the data through the multi-channel ASR front-end, utilizing ground-truth diarization.
- Fine-Tuning Details:
- Fine-tuning Duration: 35,000 updates
- Batch Size: 128
- Fine-Tuning Details:
4. Language Model for ASR Decoding: KenLM Model
ASR_LM_chime7_only.kenlm
- This KenLM model is trained solely on CHiME7-DASR datasets (Mixer6, CHiME6, DipCo).
- We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens.
- This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets.
- SentencePiece: Kudo and Richardson (2018)
- KenLM: KenLM GitRepo
- The token sets of our ASR and LM models were matched to ensure consistency.
- To combine several N-gram models with equal weights, we used the OpenGrm library.
- OpenGrm: Roark et al. (2012)
- MAES decoding was employed for the transducer, which accelerates the decoding process.
- MAES Decoding: Kim et al. (2020)
- As expected, integrating the beam-search decoder with the language model significantly enhances the performance of the end-to-end model compared to its pure counterpart.