Tokenizer can't be loaded - possibly related to recent Transformers versions

#2
by TheBloke - opened

Trying to load the tokenizer from this model in Transformers 4.35.0 results in the following error:

Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM-1B", trust_remote_code=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 755, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/workspace/huggingface/modules/transformers_modules/OpenNLPLab/TransNormerLLM-1B/cf951417e7539e292188864a12171e2e2051917f/tokenization_baichuan.py", line 76, in __init__
    super().__init__(
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/workspace/huggingface/modules/transformers_modules/OpenNLPLab/TransNormerLLM-1B/cf951417e7539e292188864a12171e2e2051917f/tokenization_baichuan.py", line 112, in get_vocab
    for i in range(self.vocab_size)
  File "/workspace/huggingface/modules/transformers_modules/OpenNLPLab/TransNormerLLM-1B/cf951417e7539e292188864a12171e2e2051917f/tokenization_baichuan.py", line 106, in vocab_size
    return self.sp_model.get_piece_size()
AttributeError: 'BaiChuanTokenizer' object has no attribute 'sp_model'
>>> import transformers
>>> print(transformers.__version__)
4.35.0
>>>

I haven't tested earlier Transformers versions, but this serr no attribute 'sp_model' is identical to an error I had with another model, which proved to be related to recent Transformers versions.

Note that your other model TransNormerLLM-7B, does not have this problem:

>>> tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM-7B", trust_remote_code=True)
A new version of the following files was downloaded from https://huggingface.co/OpenNLPLab/TransNormerLLM-7B:
- tokenization_baichuan.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
>>>

Could you fix the tokenizer of this model so it works with recent Transformers versions, like TransNormerLLM 7B does?

Thanks in advance

TheBloke

Appreciate it, for flagging this problem.

The root of the issue lies in the transformer's version. We'll be updating the tokenizer file for both the TransNormerLLM-1B and 385M models.

For a swift solution, check this link: https://github.com/baichuan-inc/Baichuan2/issues/204

We've made updates to the associated files to resolve the problem stemming from the transformer's version.

OpenNLPLab changed discussion status to closed

Sign up or log in to comment