pdelobelle
commited on
Commit
β’
dfb2efc
1
Parent(s):
607f97c
Update README.md
Browse files
README.md
CHANGED
@@ -56,6 +56,76 @@ Key improvements over Gemma-2B baseline:
|
|
56 |
|
57 |
Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
|
58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
## Safety & Ethics
|
60 |
|
61 |
### Toxicity
|
|
|
56 |
|
57 |
Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
|
58 |
|
59 |
+
<table class="model-comparison">
|
60 |
+
<thead>
|
61 |
+
<tr>
|
62 |
+
<th align="left">Model</th>
|
63 |
+
<th align="center" colspan="2">ARC-DE</th>
|
64 |
+
<th align="center" colspan="2">HellaSwag-DE</th>
|
65 |
+
<th align="center">TruthfulQA-DE</th>
|
66 |
+
<th align="center">Average</th>
|
67 |
+
</tr>
|
68 |
+
<tr>
|
69 |
+
<th></th>
|
70 |
+
<th align="center">0-shot</th>
|
71 |
+
<th align="center">3-shot</th>
|
72 |
+
<th align="center">0-shot</th>
|
73 |
+
<th align="center">3-shot</th>
|
74 |
+
<th align="center">0-shot</th>
|
75 |
+
<th align="center">0-shot</th>
|
76 |
+
</tr>
|
77 |
+
</thead>
|
78 |
+
<tbody>
|
79 |
+
<tr>
|
80 |
+
<td>Gemma-2-2B</td>
|
81 |
+
<td align="center">22.9</td>
|
82 |
+
<td align="center">23.1</td>
|
83 |
+
<td align="center">28.0</td>
|
84 |
+
<td align="center">27.6</td>
|
85 |
+
<td align="center">25.5</td>
|
86 |
+
<td align="center">25.5</td>
|
87 |
+
</tr>
|
88 |
+
<tr>
|
89 |
+
<td>LLaMmlein-120M</td>
|
90 |
+
<td align="center">24.7 β+8%</td>
|
91 |
+
<td align="center">-</td>
|
92 |
+
<td align="center">32.0 β+14%</td>
|
93 |
+
<td align="center">-</td>
|
94 |
+
<td align="center">25.0 β-2%</td>
|
95 |
+
<td align="center">27.2 β+7%</td>
|
96 |
+
</tr>
|
97 |
+
<tr>
|
98 |
+
<td>LLaMmlein-1B</td>
|
99 |
+
<td align="center">30.0 β+31%</td>
|
100 |
+
<td align="center">-</td>
|
101 |
+
<td align="center"><strong>48.5</strong> β+73%</td>
|
102 |
+
<td align="center">-</td>
|
103 |
+
<td align="center">23.4 β-8%</td>
|
104 |
+
<td align="center">34.0 β+33%</td>
|
105 |
+
</tr>
|
106 |
+
<tr>
|
107 |
+
<td>Sauerkraut-Gemma-2B</td>
|
108 |
+
<td align="center">28.0 β+22%</td>
|
109 |
+
<td align="center">34.6 β+50%</td>
|
110 |
+
<td align="center">37.2 β+33%</td>
|
111 |
+
<td align="center">44.1 β+60%</td>
|
112 |
+
<td align="center"><strong>32.9</strong> β+29%</td>
|
113 |
+
<td align="center">32.7 β+28%</td>
|
114 |
+
</tr>
|
115 |
+
<tr>
|
116 |
+
<td><strong>BΓΌbleLM (Ours)</strong></td>
|
117 |
+
<td align="center"><strong>32.3</strong> β+41%</td>
|
118 |
+
<td align="center"><strong>35.2</strong> β+52%</td>
|
119 |
+
<td align="center">47.9 β+71%</td>
|
120 |
+
<td align="center"><strong>46.6</strong> β+69%</td>
|
121 |
+
<td align="center">27.2 β+7%</td>
|
122 |
+
<td align="center"><strong>35.8</strong> β+40%</td>
|
123 |
+
</tr>
|
124 |
+
</tbody>
|
125 |
+
</table>
|
126 |
+
|
127 |
+
*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.*
|
128 |
+
|
129 |
## Safety & Ethics
|
130 |
|
131 |
### Toxicity
|