A Japanese transcription/diarization pipeline with Pyannote and Whisper large-v2 that uses a custom-tuned segmentation model and custom filtering on the audio (low-pass filter, equalizer, etc.) for improved performance. Can be given a video file or mp3/wav file.

Performance is considerably better than default JP whisper for most tasks involving Japanese content, with the exception of singing/karaoke (Where performance is below the original due to the training dataset.)

Requires ffmpeg, openai-whisper, pyannote and facebookresearch's demux model. Cuda is also strongly encouraged. Pyannote requies a Huggingface API key, which it will currently look for under the environment variable "HF_TOKEN_NOT_LOGIN" (At the time of this writing, naming your HF token "HF_TOKEN" causes bugs.)

Originally intended as a solo project, but I'm upping it here in the hopes it will be useful to practicioners. If you're doing work in this space please feel free to reach out.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .