chargoddard
commited on
Commit
•
388f3eb
1
Parent(s):
075d67c
Update README.md
Browse files
README.md
CHANGED
@@ -15,4 +15,19 @@ layer_slices:
|
|
15 |
end: 40
|
16 |
```
|
17 |
|
18 |
-
No fine tuning was done on this model. Yes, it's still coherent somehow.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
end: 40
|
16 |
```
|
17 |
|
18 |
+
No fine tuning was done on this model. Yes, it's still coherent somehow.
|
19 |
+
|
20 |
+
Benchmark results:
|
21 |
+
| Benchmark | Llama2-13b | Llama2-26b-tcs | Percent Change |
|
22 |
+
| --- | --- | --- | --- |
|
23 |
+
| ARC | 59.3 | 55.03 | -7.2% |
|
24 |
+
| HellaSwag | 82.15 | 79.9 | -2.74% |
|
25 |
+
| MMLU | 55.67 | 53.73| -3.48% |
|
26 |
+
| TruthfulQA | 37.39 | 40.48 | +5.59% |
|
27 |
+
| Average | 58.63 | 57.29 | -2.29% |
|
28 |
+
| Average Minus TQA | 65.70 | 62.85 | -4.34% |
|
29 |
+
|
30 |
+
|
31 |
+
This tells us two very important things:
|
32 |
+
1. TruthfulQA is a perfect benchmark in every way.
|
33 |
+
2. Llama models are amazingly robust to being fed their own output.
|