Proper initialization of tokenizer for rinna/japanese-wav2vec2-base
Hello,
I'm trying to use the rinna/japanese-wav2vec2-base model for Japanese speech recognition, but I'm encountering issues with initializing the tokenizer. I've tried several approaches, including:
- Using Wav2Vec2Processor.from_pretrained("rinna/japanese-wav2vec2-base")
- Using AutoProcessor.from_pretrained("rinna/japanese-wav2vec2-base")
- Initializing Wav2Vec2CTCTokenizer separately
- Creating a custom vocabulary and initializing the tokenizer with a temporary file
While the last method worked, I'm not sure if it's the correct approach.
I have a few questions:
- Is there a recommended way to initialize the tokenizer for this specific model?
- Does this model require a custom tokenizer different from the standard Wav2Vec2CTCTokenizer?
- Are there any specific considerations or steps needed when using this model for Japanese speech recognition?
I noticed this model was trained on the reazon-research/reazonspeech dataset. Does this affect how the model or tokenizer should be initialized or used?
Any guidance or best practices for using this model would be greatly appreciated. Thank you in advance for your help!
Hi
@hida1211
Thanks for your question.
This model is a pre-trained model, not a fine-tuned one, and no tokenizer is included in this model repo. Therefore, we cannot use it directly for Japanese speech recognition.
If you want to use this pre-trained model for Japanese speech recognition, it needs to be fine-tuned with a Japanese corpus beforehand. Fine-tuning can be performed using Transformers or Fairseq, as shown below:
- https://huggingface.co/blog/fine-tune-wav2vec2-english
- https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md#fine-tune-a-pre-trained-model-with-ctc
Note that we also provided a fairseq checkpoint file in this model repository.
However, since these sample codes are not suitable for fine-tuning with a Japanese corpus, some modifications might be necessary to fine-tune for Japanese ASR.
Thank you for your detailed explanation. I understand that the base model was trained only on audio data, and additional fine-tuning is necessary to perform speech recognition tasks. Also, it's clear that this model doesn't include a tokenizer. Thank you for this clarification.