Problems with the OpenGPT-X tokenizer

#4
by nicholasKluge - opened

Hello, and congratulations on the release! 🤗

I have tried to run some evaluations on the Teuken-7B-instruct-research-v0.4 but I keep encountering errors, which I believe come from the tokenizer.

Code to replicate (Transformers version: 4.42.3):

from transformers import GenerationConfig, TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True)
generation_config = GenerationConfig(max_new_tokens=100, do_sample=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

generator = TextGenerationPipeline(model=model, task="text-generation",
                                    tokenizer=tokenizer, device=device)

completion = generator("My name is", generation_config=generation_config)

Trying to use the model via the TextGenerationPipeline has given me (weirdly...) two distinct errors:

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Which is an error coming from Sentencepiece (Sentencepiece version: 0.2.0). And:

TypeError: HFGPTXTokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'

This error comes from the custom tokenizer. Could you please provide workarounds or push some patches? The skip_special_tokens error occurs when I try running this model on the Language Model Evaluation Harness, which works for most of the main models found in the Hub.

PS: Are there prospects for releasing a fast tokenizer?

Cheers!

forgot the code block specifier for Python ._.

OpenGPT-X org

@danielsteinigen can you help?

Hi @nicholasKluge , thanks for pointing this out. It should be fixed now.
We will also look into fast tokenizers soon.

mfromm changed discussion status to closed

Sign up or log in to comment