lack of digit splitting in slow version of tokenizer

#11

by Forence - opened May 30, 2024

May 30, 2024

Like the issue in https://huggingface.co/stabilityai/stablelm-2-12b/discussions/1#66151ad2d8148d0b668985f0, GPT2Tokenizer (slow) doesn't apply the pre-tokenization split rule defined in the tokenizer.json.

Suggest that this bug can be solved, because I met severe performance drop due to it, and it's quite hard to debug. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment