aisquared
/

dlite-v2-355m

@@ -96,17 +96,17 @@ We present the results from various model benchmarks on the EleutherAI LLM Evalu
 Model results are sorted by mean score, ascending, to provide an ordering. These metrics serve to further show that none of the DLite models are
 state of the art, but rather further show that chat-like behaviors in LLMs can be trained almost independent of model size.
-| model         |   openbookqa |   arc_easy |   winogrande |   hellaswag |   arc_challenge |     piqa |    boolq |
-|:--------------|-------------:|-----------:|-------------:|------------:|----------------:|---------:|---------:|
-| gpt2          |        0.164 |   0.438131 |     0.51618  |    0.289185 |        0.190273 | 0.628945 | 0.487156 |
-| dlite-v2-124m |        0.174 |   0.44697  |     0.502762 |    0.291974 |        0.192833 | 0.631665 | 0.520183 |
-| dlite-v1-124m |        0.17  |   0.462542 |     0.494081 |    0.293268 |        0.223549 | 0.622416 | 0.502446 |
-| gpt2-medium   |        0.186 |   0.490741 |     0.531176 |    0.333101 |        0.215017 | 0.676279 | 0.585933 |
-| dlite-v2-355m |        0.206 |   0.493687 |     0.524073 |    0.334993 |        0.226109 | 0.670838 | 0.582263 |
-| dlite-v1-355m |        0.216 |   0.507576 |     0.496448 |    0.338478 |        0.234642 | 0.664309 | 0.600306 |
-| gpt2-large    |        0.194 |   0.531566 |     0.553275 |    0.363971 |        0.216724 | 0.703482 | 0.604893 |
-| dlite-774m-v2 |        0.212 |   0.539562 |     0.5588   |    0.365565 |        0.234642 | 0.700218 | 0.60367  |
-| dlite-774m-v1 |        0.218 |   0.545875 |     0.562747 |    0.375124 |        0.250853 | 0.698041 | 0.614985 |
-| gpt2-xl       |        0.224 |   0.582912 |     0.583268 |    0.400418 |        0.25     | 0.708379 | 0.617737 |
-| dlite-v1-1.5b |        0.226 |   0.588384 |     0.584846 |    0.401414 |        0.268771 | 0.708379 | 0.624159 |
-| dlite-v2-1.5b |        0.226 |   0.59596  |     0.581689 |    0.40719  |        0.273891 | 0.705114 | 0.630887 |

 Model results are sorted by mean score, ascending, to provide an ordering. These metrics serve to further show that none of the DLite models are
 state of the art, but rather further show that chat-like behaviors in LLMs can be trained almost independent of model size.
+| Model         |   arc_challenge |   arc_easy |    boolq |   hellaswag |   openbookqa |     piqa |   winogrande |
+|:--------------|----------------:|-----------:|---------:|------------:|-------------:|---------:|-------------:|
+| dlite-v2-124m |        0.199659 |   0.447811 | 0.494801 |    0.291675 |        0.156 | 0.620239 |     0.487766 |
+| gpt2          |        0.190273 |   0.438131 | 0.487156 |    0.289185 |        0.164 | 0.628945 |     0.51618  |
+| dlite-v1-124m |        0.223549 |   0.462542 | 0.502446 |    0.293268 |        0.17  | 0.622416 |     0.494081 |
+| gpt2-medium   |        0.215017 |   0.490741 | 0.585933 |    0.333101 |        0.186 | 0.676279 |     0.531176 |
+| dlite-v2-355m |        0.251706 |   0.486111 | 0.547401 |    0.344354 |        0.216 | 0.671926 |     0.52723  |
+| dlite-v1-355m |        0.234642 |   0.507576 | 0.600306 |    0.338478 |        0.216 | 0.664309 |     0.496448 |
+| gpt2-large    |        0.216724 |   0.531566 | 0.604893 |    0.363971 |        0.194 | 0.703482 |     0.553275 |
+| dlite-v1-774m |        0.250853 |   0.545875 | 0.614985 |    0.375124 |        0.218 | 0.698041 |     0.562747 |
+| dlite-v2-774m |        0.269625 |   0.52904  | 0.613761 |    0.395937 |        0.256 | 0.691513 |     0.566693 |
+| gpt2-xl       |        0.25     |   0.582912 | 0.617737 |    0.400418 |        0.224 | 0.708379 |     0.583268 |
+| dlite-v1-1_5b |        0.268771 |   0.588384 | 0.624159 |    0.401414 |        0.226 | 0.708379 |     0.584846 |
+| dlite-v2-1_5b |        0.289249 |   0.565657 | 0.601223 |    0.434077 |        0.272 | 0.703482 |     0.588003 |