Pretrain Phi2 on Indian Languages

#95

by praveengovi - opened Jan 23

Jan 23

Is it possible to create the augmented tokenizer based on Phi2 model for pretraining on new language ?
I cannot find the pre training code for Phi2 model , Kindly share if any one came across

Microsoft org Jan 26

Yes, you can extend the tokenizer and add the new tokens (related to Indian-based languages). Just also make sure that you account the vocabulary size for the newer tokens (if you are adding more than it supports right now).
We did not release pre-training code for Phi-2, however, you can accomplish the pre-training with whichever tools you are fond with. For example, transformers and accelerate.

Regards,
Gustavo.

gugarosa changed discussion status to closed Jan 26

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment