ai4bharat/indictrans2-en-indic-1B · Getting error while loading the tokenizer after fine-tuning

11 days ago

Hi, after fine-tuning the "ai4bharat/indictrans2-en-indic-1B" model, when I tried to load the tokenizer by "tokenizer = AutoTokenizer.from_pretrained(finetuned_model_dir, trust_remote_code=True)" getting the following error in the file "tokenization_indictrans.py":
In line 120 --
TypeError: transformers.tokenization_utils.PreTrainedTokenizer.init() got multiple values for keyword argument 'src_vocab_file'

pranjalchitale

AI4Bharat org 11 days ago

I’m assuming you didn’t modify the vocabulary or tokenizer and just used the existing tokenizer to preprocess your data and fine-tune the model.

If that’s the case, you can directly use:

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True)

This will load the tokenizer, and it should work as expected and is compatible even with your fine-tuned model.

The error occurs because, when saving with tokenizer.save_pretrained, some additional fields are being saved to the config. These include the arguments mentioned above, which are added to the **kwargs. While the original config or thus **kwargs doesn’t have these extra arguments, this saved one has it and this creates duplicate values for the same keyword argument (as we already pass appropriate paths manually in the tokenization script).

debtanumathcs

10 days ago

It's working.. Thanks a lot!

pranjalchitale changed discussion status to closed 10 days ago