metadata

language: ja
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - audio
  - automatic-speech-recognition
  - hf-asr-leaderboard
widget:
  - example_title: Sample 1
    src: >-
      https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3

Kotoba-Whisper-v2.2

Kotoba-Whisper-v2.2 is a Japanese ASR model based on kotoba-tech/kotoba-whisper-v2.0, with additional postprocessing stacks integrated as pipeline. The new features includes (i) speaker diarization with diarizers and (ii) adding punctuation with punctuators. The pipeline has been developed through the collaboration between Asahi Ushio and Kotoba Technologies

Transformers Usage

Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git

To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:

And subsequently use a Hugging Face authentication token to log in with:

huggingface-cli login

Transcription with Diarization

The model can be used with the pipeline.

Download an audio sample.

wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3

Run the model via pipeline.

import torch
from transformers import pipeline

# config
model_id = "kotoba-tech/kotoba-whisper-v2.2"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True,
    punctuator=False,
    return_unique_speaker=True
)

# run inference
result = pipe("sample_diarization_japanese.mp3", generate_kwargs=generate_kwargs)
print(result)
>>>
{'chunks': [{'speaker': ['SPEAKER_02'],
             'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
             'timestamp': (0.0, 5.0)},
            {'speaker': ['SPEAKER_02'],
             'text': '今は屋外の気温',
             'timestamp': (5.0, 7.6)},
            {'speaker': ['SPEAKER_02'],
             'text': '昼も夜も上がってますので空気の入れ替えだけでは',
             'timestamp': (7.6, 11.72)},
            {'speaker': ['SPEAKER_02'],
             'text': 'かえって人が上がってきます',
             'timestamp': (11.72, 13.54)},
            {'speaker': ['SPEAKER_02'],
             'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
             'timestamp': (13.54, 17.24)},
            {'speaker': ['SPEAKER_00'],
             'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
             'timestamp': (17.24, 23.84)}],
 'chunks/SPEAKER_00': [{'speaker': ['SPEAKER_00'],
                        'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
                        'timestamp': (17.24, 23.84)}],
 'chunks/SPEAKER_02': [{'speaker': ['SPEAKER_02'],
                        'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
                        'timestamp': (0.0, 5.0)},
                       {'speaker': ['SPEAKER_02'],
                        'text': '今は屋外の気温',
                        'timestamp': (5.0, 7.6)},
                       {'speaker': ['SPEAKER_02'],
                        'text': '昼も夜も上がってますので空気の入れ替えだけでは',
                        'timestamp': (7.6, 11.72)},
                       {'speaker': ['SPEAKER_02'],
                        'text': 'かえって人が上がってきます',
                        'timestamp': (11.72, 13.54)},
                       {'speaker': ['SPEAKER_02'],
                        'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
                        'timestamp': (13.54, 17.24)}],
 'speakers': ['SPEAKER_00', 'SPEAKER_02'],
 'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていうそういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
 'text/SPEAKER_00': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
 'text/SPEAKER_02': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていう'}

To activate punctuator:

-     punctuator=True,
+     punctuator=False,

To include more than a single speaker:

-     return_unique_speaker=True
+     return_unique_speaker=False

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}

Acknowledgements

OpenAI for the Whisper model.
Hugging Face 🤗 Transformers for the model integration.
Hugging Face 🤗 for the Distil-Whisper codebase.
Reazon Human Interaction Lab for the ReazonSpeech dataset.