rinna/japanese-wav2vec2-base · Proper initialization of tokenizer for rinna/japanese-wav2vec2-base

Jul 30, 2024

Hello,
I'm trying to use the rinna/japanese-wav2vec2-base model for Japanese speech recognition, but I'm encountering issues with initializing the tokenizer. I've tried several approaches, including:

Using Wav2Vec2Processor.from_pretrained("rinna/japanese-wav2vec2-base")
Using AutoProcessor.from_pretrained("rinna/japanese-wav2vec2-base")
Initializing Wav2Vec2CTCTokenizer separately
Creating a custom vocabulary and initializing the tokenizer with a temporary file

While the last method worked, I'm not sure if it's the correct approach.

I have a few questions:

Is there a recommended way to initialize the tokenizer for this specific model?
Does this model require a custom tokenizer different from the standard Wav2Vec2CTCTokenizer?
Are there any specific considerations or steps needed when using this model for Japanese speech recognition?

I noticed this model was trained on the reazon-research/reazonspeech dataset. Does this affect how the model or tokenizer should be initialized or used?
Any guidance or best practices for using this model would be greatly appreciated. Thank you in advance for your help!

yky-h

rinna Co., Ltd. org Aug 2, 2024

Hi @hida1211
Thanks for your question.
This model is a pre-trained model, not a fine-tuned one, and no tokenizer is included in this model repo. Therefore, we cannot use it directly for Japanese speech recognition.
If you want to use this pre-trained model for Japanese speech recognition, it needs to be fine-tuned with a Japanese corpus beforehand. Fine-tuning can be performed using Transformers or Fairseq, as shown below:

Note that we also provided a fairseq checkpoint file in this model repository.
However, since these sample codes are not suitable for fine-tuning with a Japanese corpus, some modifications might be necessary to fine-tune for Japanese ASR.

hida1211

Aug 5, 2024

Thank you for your detailed explanation. I understand that the base model was trained only on audio data, and additional fine-tuning is necessary to perform speech recognition tasks. Also, it's clear that this model doesn't include a tokenizer. Thank you for this clarification.

yky-h changed discussion status to closed Aug 5, 2024