Would you be willing to share some training details?
#1
by
ZzWater
- opened
Hi beomi,
I tried this model, and it performs very competitively on downstream tasks. Would you be willing to share some training details? For instance,
- How to expand the vocabulary,
- The scale and sources of continued pretraining data.
I trained new tokenizer with SPM on my Korean+English Corpus, and selected with my :eyes: and slected top ~16000 Korean tokens, and added them into Yi tokenizer.(and also added merges)
This model is trained with more than 60B(~120GB) Korean tokens with new tokenizer -- which includes Korean Wikipedia and other online scrapped corpus.
beomi
changed discussion status to
closed