BUT-FIT
/

csmpt7b

@@ -5,22 +5,36 @@ license: apache-2.0
 # Eval
-Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark)
-| Model | Model Accuracy |
 |---------------|----------------|
 | mistral7b       | 0.4992         |
-| csmpt-130k    | __0.5004__         |
-| csmpt-100k    | 0.4959         |
-| csmpt-75k     | 0.4895         |
-| csmpt-50k steps | 0.4755       |
-| csmpt-26.5k steps | 0.4524      |
 However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
 The improvement over mistral7b is not significant.
 ## Loss
-tbd.
 ## Training Method

 # Eval
+Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
+| Model | CS-HellaSwag Accuracy |
 |---------------|----------------|
 | mistral7b       | 0.4992         |
+| csmpt@130k steps [released]   | __0.5004__         |
+| csmpt@100k steps   | 0.4959         |
+| csmpt@75k steps    | 0.4895         |
+| csmpt@50k steps | 0.4755       |
+| csmpt@26,5k steps | 0.4524      |
 However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
 The improvement over mistral7b is not significant.
+<TBD> More evaluation details teaser.
 ## Loss
+We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be
+- (a) influenced by learning rate, the lower the learning rate, less they appear, as it gets higher, they start to appear, and with too high learning rate, the training might diverge on such loss spike.
+- (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
+such instabilities were previously observed only for much larger models (larger than 65b).
+The model was trained on 3 corpora. Corpus #1  was the same we used for GPT-2 training (~16b tokens). <TBD MF>
+<img src="figures/tloss_full.png"  width="900"/>
+Figure 1: Training loss.
+<img src="figures/tloss_closeup.png" width="900"/>
+Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. <TBD MF>
+<img src="figures/vloss_closeup.png" width="900"/>
+Figure 3: Test loss closeup, testing performed on internal-corpus #1. <TBD MF>
 ## Training Method