Pretrain Phi2 on Indian Languages

#95
by praveengovi - opened
  1. Is it possible to create the augmented tokenizer based on Phi2 model for pretraining on new language ?

  2. I cannot find the pre training code for Phi2 model , Kindly share if any one came across

Microsoft org

Hello @praveengovi !

  1. Yes, you can extend the tokenizer and add the new tokens (related to Indian-based languages). Just also make sure that you account the vocabulary size for the newer tokens (if you are adding more than it supports right now).

  2. We did not release pre-training code for Phi-2, however, you can accomplish the pre-training with whichever tools you are fond with. For example, transformers and accelerate.

Regards,
Gustavo.

gugarosa changed discussion status to closed

Sign up or log in to comment