Update README.md
Browse files
README.md
CHANGED
@@ -207,40 +207,40 @@ This Ahma 7B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
|
|
207 |
|
208 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
209 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
210 |
-
| Analogies | 50.77 | 48.46 | 56.92 |
|
211 |
-
| Arithmetic | 27.64 | 22.14 | 11.50 |
|
212 |
-
| Cause and Effect | 59.48 | 58.82 | 59.48 |
|
213 |
-
| Emotions | 36.25 | 28.12 | 36.25 |
|
214 |
-
| Empirical Judgements | 33.33 | 35.35 | 33.33 |
|
215 |
-
| General Knowledge | 44.29 | 48.57 | 51.43 |
|
216 |
-
| HHH Alignment | 42.09 | 41.66 | 44.23 |
|
217 |
-
| Intent Recognition | 24.42 | 26.16 | 43.64 |
|
218 |
-
| Misconceptions | 46.27 | 47.01 | 46.27 |
|
219 |
-
| Paraphrase | 59.50 | 73.00 | 67.00 |
|
220 |
-
| Sentence Ambiguity | 53.33 | 65.00 | 60.00 |
|
221 |
-
| Similarities Abstraction | 65.79 | 68.42 | 71.05 |
|
222 |
-
| **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** |
|
223 |
-
| **Overall Average** | **36.49** | **34.06** | **29.20** |
|
224 |
|
225 |
|
226 |
3-shot results:
|
227 |
|
228 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
229 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
230 |
-
| Analogies | 50.77 | 49.23 | 49.23 |
|
231 |
-
| Arithmetic | 38.38 | 43.89 | 20.88 |
|
232 |
-
| Cause and Effect | 60.78 | 64.71 | 66.01 |
|
233 |
-
| Emotions | 30.00 | 41.25 | 30.00 |
|
234 |
-
| Empirical Judgements | 46.46 | 44.44 | 39.39 |
|
235 |
-
| General Knowledge | 47.14 | 40.00 | 27.14 |
|
236 |
-
| HHH Alignment | 43.53 | 44.80 | 43.80 |
|
237 |
-
| Intent Recognition | 20.52 | 44.22 | 36.42 |
|
238 |
-
| Misconceptions | 50.75 | 52.24 | 46.27 |
|
239 |
-
| Paraphrase | 50.50 | 58.50 | 57.50 |
|
240 |
-
| Sentence Ambiguity | 53.33 | 48.33 | 53.33 |
|
241 |
-
| Similarities Abstraction | 69.74 | 72.37 | 72.37 |
|
242 |
-
| **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** |
|
243 |
-
| **Overall Average** | **42.87** | **47.27** | **33.41** |
|
244 |
|
245 |
|
246 |
As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
|
@@ -254,31 +254,31 @@ This Ahma 7B base model was also evaluated using [MTBench Finnish by LumiOpen](h
|
|
254 |
|
255 |
Single-turn results:
|
256 |
|
257 |
-
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
|
258 |
-
|
259 |
-
| Coding | 1.00 | 1.00
|
260 |
-
| Extraction | 2.00 | 1.30
|
261 |
-
| Humanities | 4.05 | 6.20
|
262 |
-
| Math | 3.00 | 3.20
|
263 |
-
| Reasoning | 2.90 | 4.60
|
264 |
-
| Roleplay | 4.80 | 6.50
|
265 |
-
| STEM | 5.10 | 5.95
|
266 |
-
| Writing | 6.60 | 9.00
|
267 |
-
| **Overall Average** | **3.68** | **4.72**
|
268 |
|
269 |
Multi-turn results:
|
270 |
|
271 |
-
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
|
272 |
-
|
273 |
-
| Coding | 1.00 | 1.00
|
274 |
-
| Extraction | 1.55 | 1.15
|
275 |
-
| Humanities | 3.25 | 6.20
|
276 |
-
| Math | 2.20 | 2.70
|
277 |
-
| Reasoning | 2.45 | 3.50
|
278 |
-
| Roleplay | 4.90 | 6.40
|
279 |
-
| STEM | 4.20 | 4.78
|
280 |
-
| Writing | 3.80 | 6.65
|
281 |
-
| **Overall Average** | **2.92** | **4.05**
|
282 |
|
283 |
As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
|
284 |
|
|
|
207 |
|
208 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
209 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
210 |
+
| Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
|
211 |
+
| Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
|
212 |
+
| Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
|
213 |
+
| Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
|
214 |
+
| Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
|
215 |
+
| General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
|
216 |
+
| HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
|
217 |
+
| Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
|
218 |
+
| Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
|
219 |
+
| Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
|
220 |
+
| Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
|
221 |
+
| Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
|
222 |
+
| **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
|
223 |
+
| **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
|
224 |
|
225 |
|
226 |
3-shot results:
|
227 |
|
228 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
229 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
230 |
+
| Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
|
231 |
+
| Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
|
232 |
+
| Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
|
233 |
+
| Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
|
234 |
+
| Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
|
235 |
+
| General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
|
236 |
+
| HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
|
237 |
+
| Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
|
238 |
+
| Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
|
239 |
+
| Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
|
240 |
+
| Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
|
241 |
+
| Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
|
242 |
+
| **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
|
243 |
+
| **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
|
244 |
|
245 |
|
246 |
As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
|
|
|
254 |
|
255 |
Single-turn results:
|
256 |
|
257 |
+
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
|
258 |
+
|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
|
259 |
+
| Coding | 1.00 | 1.00 | 1.70 | 1.10 |
|
260 |
+
| Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
|
261 |
+
| Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
|
262 |
+
| Math | 3.00 | 3.20 | 3.90 | 2.90 |
|
263 |
+
| Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
|
264 |
+
| Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
|
265 |
+
| STEM | 5.10 | 5.95 | 6.75 | 7.30 |
|
266 |
+
| Writing | 6.60 | 9.00 | 7.10 | 8.80 |
|
267 |
+
| **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
|
268 |
|
269 |
Multi-turn results:
|
270 |
|
271 |
+
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
|
272 |
+
|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
|
273 |
+
| Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
|
274 |
+
| Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
|
275 |
+
| Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
|
276 |
+
| Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
|
277 |
+
| Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
|
278 |
+
| Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
|
279 |
+
| STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
|
280 |
+
| Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
|
281 |
+
| **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
|
282 |
|
283 |
As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
|
284 |
|