Would you be willing to share some training details?

#1
by ZzWater - opened

Hi beomi,
I tried this model, and it performs very competitively on downstream tasks. Would you be willing to share some training details? For instance,

  1. How to expand the vocabulary,
  2. The scale and sources of continued pretraining data.
Owner

I trained new tokenizer with SPM on my Korean+English Corpus, and selected with my :eyes: and slected top ~16000 Korean tokens, and added them into Yi tokenizer.(and also added merges)

This model is trained with more than 60B(~120GB) Korean tokens with new tokenizer -- which includes Korean Wikipedia and other online scrapped corpus.

beomi changed discussion status to closed

Sign up or log in to comment