Special tokens for instruction template

#3
by Weker - opened

I hope this is a good place to ask this but when I run the original Llama 3 model then the instructions like

  • '<|begin_of_text|>'
  • '<|start_header_id|>'
  • '<|end_header_id|>'

get tokenized into one token each (128000, 128006 and 128006 respectivly) but when running this model every instruction gets tokenized into way more token for example '<|start_header_id|>' gets translated to:

  • 128000 - ''
  • 27 - '<'
  • 91 - '|'
  • 2527 - 'start'
  • 8932 - '_header'
  • 851 - '_id'
  • 91 - '|'
  • 29 - '>'

Is that intended behavior or am I doing something wrong? I noticed this when I used a lot of short "user" sections and ran out of context fast.

Edit: I am using Text generation web UI. I don't know if that is relevant.

Sign up or log in to comment