Hibiki ASR Phonemizer

This model is a Phoneme Level Speech Recognition network, originally a fine-tuned version of openai/whisper-large-v3 on a mixture of Different Japanese datasets.

it can detect, transcribe and do the following:

non-speech sounds such as gasp, erotic moans, laughter, etc.
adding punctuations more faithfully.

a Grapheme decoder head (i.e outputting normal Japanese) will probably be trained as well. Though going directly from audio to Phonemes will result in a more accurate representation for Japanese.

Don't use this model without the post processing functions I wrote below, or you'll get less than ideal performance. check the notebook.

How to use

Check here -> Notebook

Intended uses & limitations

No restrictions is imposed by me, but proceed at your own risk, The User (You) are entirely responisble for their actions.

Training and evaluation data

Japanese Common Voice 17
ehehe Corpus
Custom Game and Anime dataset (around 8 hours)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 24
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 5000

Compute and Duration

1x A100(40G)
64gb RAM
BF16
14hrs

Framework versions

Transformers 4.41.1
Pytorch 2.4.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

Respair
/

Hibiki_ASR_Phonemizer_v0.2