Finnish-NLP/ul2-base-nl36-finnish · eos token to separate documents

Hi, I know this code snippet is mostly taken from the original paper but I noticed that neither this ul2_denoiser or T5x span_corruption actually separates documents with eos token. Not sure if this is intended but it definitely made sense to me to give an idea to the model when two adjacent documents are different in the input.

Here I created a notebook which adds eos.

https://www.kaggle.com/keremt/seqio-tutorial/
(Search for # FIX: Added append_eos to separate different documents in encoder.)

Also in your script vocabulary needs to be vocabulary = seqio.SentencePieceVocabulary('/kaggle/input/google-ul2/spiece.model', extra_ids=100) I guess, it is missing the extra ids. Models might have trained well regardless, e.g. it would be using the last 100 tokens in the vocab which might not be that frequent to affect the results. So models would learn to use those tokens as a denoising prompt token.