language: ja
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Sample 1
src: >-
https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
Kotoba-Whisper-v2.2
Kotoba-Whisper-v2.2 is a Japanese ASR model based on kotoba-tech/kotoba-whisper-v2.0, with
additional postprocessing stacks integrated as pipeline
. The new features includes
(i) speaker diarization with diarizers
and (ii) adding punctuation with punctuators.
The pipeline has been developed through the collaboration between Asahi Ushio and Kotoba Technologies
Transformers Usage
Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git
To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:
And subsequently use a Hugging Face authentication token to log in with:
huggingface-cli login
Transcription with Diarization
The model can be used with the pipeline
.
- Download an audio sample.
wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
- Run the model via pipeline.
import torch
from transformers import pipeline
# config
model_id = "kotoba-tech/kotoba-whisper-v2.2"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}
# load model
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16,
trust_remote_code=True,
punctuator=False,
return_unique_speaker=True
)
# run inference
result = pipe("sample_diarization_japanese.mp3", generate_kwargs=generate_kwargs)
print(result)
>>>
{'chunks': [{'speaker': ['SPEAKER_02'],
'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
'timestamp': (0.0, 5.0)},
{'speaker': ['SPEAKER_02'],
'text': '今は屋外の気温',
'timestamp': (5.0, 7.6)},
{'speaker': ['SPEAKER_02'],
'text': '昼も夜も上がってますので空気の入れ替えだけでは',
'timestamp': (7.6, 11.72)},
{'speaker': ['SPEAKER_02'],
'text': 'かえって人が上がってきます',
'timestamp': (11.72, 13.54)},
{'speaker': ['SPEAKER_02'],
'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
'timestamp': (13.54, 17.24)},
{'speaker': ['SPEAKER_00'],
'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'timestamp': (17.24, 23.84)}],
'chunks/SPEAKER_00': [{'speaker': ['SPEAKER_00'],
'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'timestamp': (17.24, 23.84)}],
'chunks/SPEAKER_02': [{'speaker': ['SPEAKER_02'],
'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
'timestamp': (0.0, 5.0)},
{'speaker': ['SPEAKER_02'],
'text': '今は屋外の気温',
'timestamp': (5.0, 7.6)},
{'speaker': ['SPEAKER_02'],
'text': '昼も夜も上がってますので空気の入れ替えだけでは',
'timestamp': (7.6, 11.72)},
{'speaker': ['SPEAKER_02'],
'text': 'かえって人が上がってきます',
'timestamp': (11.72, 13.54)},
{'speaker': ['SPEAKER_02'],
'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
'timestamp': (13.54, 17.24)}],
'speakers': ['SPEAKER_00', 'SPEAKER_02'],
'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていうそういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'text/SPEAKER_00': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'text/SPEAKER_02': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていう'}
- To activate punctuator:
- punctuator=True,
+ punctuator=False,
- To include more than a single speaker:
- return_unique_speaker=True
+ return_unique_speaker=False
Flash Attention 2
We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:
pip install flash-attn --no-build-isolation
Then pass attn_implementation="flash_attention_2"
to from_pretrained
:
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
Acknowledgements
- OpenAI for the Whisper model.
- Hugging Face 🤗 Transformers for the model integration.
- Hugging Face 🤗 for the Distil-Whisper codebase.
- Reazon Human Interaction Lab for the ReazonSpeech dataset.