openGPT-X/Teuken-7B-instruct-research-v0.4 · Problems with the OpenGPT-X tokenizer

26 days ago

•

Hello, and congratulations on the release! 🤗

I have tried to run some evaluations on the Teuken-7B-instruct-research-v0.4 but I keep encountering errors, which I believe come from the tokenizer.

Code to replicate (Transformers version: 4.42.3):

from transformers import GenerationConfig, TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True)
generation_config = GenerationConfig(max_new_tokens=100, do_sample=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

generator = TextGenerationPipeline(model=model, task="text-generation",
                                    tokenizer=tokenizer, device=device)

completion = generator("My name is", generation_config=generation_config)

Trying to use the model via the TextGenerationPipeline has given me (weirdly...) two distinct errors:

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Which is an error coming from Sentencepiece (Sentencepiece version: 0.2.0). And:

TypeError: HFGPTXTokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'

This error comes from the custom tokenizer. Could you please provide workarounds or push some patches? The skip_special_tokens error occurs when I try running this model on the Language Model Evaluation Harness, which works for most of the main models found in the Hub.

PS: Are there prospects for releasing a fast tokenizer?

Cheers!

nicholasKluge

26 days ago

forgot the code block specifier for Python ._.

NFH

OpenGPT-X org 25 days ago

@danielsteinigen can you help?

danielsteinigen

OpenGPT-X org 24 days ago

•

edited 24 days ago

Hi @nicholasKluge , thanks for pointing this out. It should be fixed now.
We will also look into fast tokenizers soon.

mfromm changed discussion status to closed 11 days ago