Reason behind not using special tokens in the prompt format?

#2
by Doctor-Shotgun - opened

Hello, hobbyist model finetuner here. Thanks for sharing your training hyperparameters!

I was just curious if there was a specific reason behind not using dedicated special tokens for role headers in the prompt format (such as the ones already defined in the llama 3 tokenizer, i.e. <|start_header_id|>etc.)?

It appears that the <|system|>, <|user|>, and <|assistant|> headers used in the prompt format are not defined special tokens, so they could in theory be variably tokenized into different combinations of substrings during training/inference.

From the paper it seems like some empiric testing was done - was this also attempted with the tokens above being defined as special?

I just found about this and I'm curious as well.

Sign up or log in to comment