Getting error while loading the tokenizer after fine-tuning

#8
by debtanumathcs - opened

Hi, after fine-tuning the "ai4bharat/indictrans2-en-indic-1B" model, when I tried to load the tokenizer by "tokenizer = AutoTokenizer.from_pretrained(finetuned_model_dir, trust_remote_code=True)" getting the following error in the file "tokenization_indictrans.py":
In line 120 --
TypeError: transformers.tokenization_utils.PreTrainedTokenizer.init() got multiple values for keyword argument 'src_vocab_file'

AI4Bharat org

I’m assuming you didn’t modify the vocabulary or tokenizer and just used the existing tokenizer to preprocess your data and fine-tune the model.

If that’s the case, you can directly use:

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True)

This will load the tokenizer, and it should work as expected and is compatible even with your fine-tuned model.

The error occurs because, when saving with tokenizer.save_pretrained, some additional fields are being saved to the config. These include the arguments mentioned above, which are added to the **kwargs. While the original config or thus **kwargs doesn’t have these extra arguments, this saved one has it and this creates duplicate values for the same keyword argument (as we already pass appropriate paths manually in the tokenization script).

It's working.. Thanks a lot!

pranjalchitale changed discussion status to closed

Sign up or log in to comment