BUT-FIT
/

csmpt7b

@@ -44,8 +44,7 @@ In Figure 2, we perform  two ablations:
 - (a) After first hot swap, we continued training on the corpus #1 for a while. Result: The fact that test loss is slightly better, signifies the slight difference between distribution of corpus #1 and corpus #2.
 - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap, we resume training from step 93,000 using corpus #3.The optimizer states were reinitialized. Result: Neither corpus #3, nor optimizier state reinitialization seems to mitigate the issue of local divergence at step 94,000.
--
 <img src="figures/vloss_closeup.png" width="900"/>
 Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
@@ -54,7 +53,6 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
 ### Vocabulary Swap
 To transfer knowledge from English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from original language to new language.
 <img src="figures/tllama_test.png" width="900"/>
 Figure 4: Ablation: Test perplexity over the course of training for vocabulary swap method on TinyLLAMA. Our method (green curve) vs TinyLLAMA training from scratch (blue curve).
 The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)

 - (a) After first hot swap, we continued training on the corpus #1 for a while. Result: The fact that test loss is slightly better, signifies the slight difference between distribution of corpus #1 and corpus #2.
 - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap, we resume training from step 93,000 using corpus #3.The optimizer states were reinitialized. Result: Neither corpus #3, nor optimizier state reinitialization seems to mitigate the issue of local divergence at step 94,000.
 <img src="figures/vloss_closeup.png" width="900"/>
 Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
 ### Vocabulary Swap
 To transfer knowledge from English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from original language to new language.
 <img src="figures/tllama_test.png" width="900"/>
 Figure 4: Ablation: Test perplexity over the course of training for vocabulary swap method on TinyLLAMA. Our method (green curve) vs TinyLLAMA training from scratch (blue curve).
 The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)