Adding '\n' to this model (using CTranslate2)

#7
by Geremia23 - opened

How do I add special tokens (like \n) to this model?

tokenizer.add_tokens('\n') seems to work, but CTranslate2 drops the \n when translating:

import ctranslate2
import transformers

translator = ctranslate2.Translator("opus-mt-de-en", device='cuda')
tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")

# add special token
tokenizer.add_tokens('\n')     # output:  1

tokenizer.added_tokens_decoder     # output: {58101: '\n'}
tokenizer.added_tokens_encoder     # output: {'\n': 58101}

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Guten\ntag!"))    # ==  ['▁Guten', '\n', '▁', 'tag', '!', '</s>']
results = translator.translate_batch([source], beam_size=5)     # == [TranslationResult(hypotheses=[['▁Good', '▁day', '!']], scores=[], attention=[])]      ← NOTICE THE `\n` IS DROPPED!

How do I get CTranslate2 to map token ID #58101 to \n?

Geremia23 changed discussion title from Add '\n' to this model? to Adding '\n' to this model (using CTranslate2)
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment