Infinite loop loading tokenizer (specific to 33B-GPTQ repo)
Hi all, loading the tokenizer leads to an infinite loop with the latest transformers:
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
return self.unk_token_id
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1150, in unk_token_id
return self.convert_tokens_to_ids(self.unk_token)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1030, in unk_token
return str(self._unk_token)
RecursionError: maximum recursion depth exceeded while getting the str of an object
I resolved this by just using the 65B or 7B tokenizer configs:
tokenizer = AutoTokenizer.from_pretrained(
"TheBloke/guanaco-65B-GPTQ",
use_fast=True
)
Is there a reason the 33B repo has a specifically different tokenizer?
Oh, interesting. I never noticed that.
So the 33B tokenizer I got from the model Tim Dettmers merged himself, TimDettmers/guanaco-33b-merged.
He didn't merge the other sizes, so I merged them myself and used the standard Llama base tokenizers for those.
I've compared the 33B and 65B tokenizers and yeah there's a few differences. For example the 33B doesn't list the <s>
token here:
Where 65B does:
Given you're getting errors, I have removed the 33B tokenizer and replaced it with the file from my 65B repo. Thanks for the report!
Thanks, fixed