Would you be willing to share some training details?

by ZzWater - opened Dec 22, 2023

Dec 22, 2023

•

edited Dec 22, 2023

Hi beomi,
I tried this model, and it performs very competitively on downstream tasks. Would you be willing to share some training details? For instance,

How to expand the vocabulary,
The scale and sources of continued pretraining data.

beomi

Owner Jan 5

I trained new tokenizer with SPM on my Korean+English Corpus, and selected with my :eyes: and slected top ~16000 Korean tokens, and added them into Yi tokenizer.(and also added merges)

This model is trained with more than 60B(~120GB) Korean tokens with new tokenizer -- which includes Korean Wikipedia and other online scrapped corpus.

beomi changed discussion status to closed Jan 5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment