Text Generation
Transformers
Safetensors
Finnish
llama
finnish
conversational
text-generation-inference
aapot commited on
Commit
ee9147f
·
verified ·
1 Parent(s): fcb4832

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -50
README.md CHANGED
@@ -207,40 +207,40 @@ This Ahma 7B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
207
 
208
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
209
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
210
- | Analogies | 50.77 | 48.46 | 56.92 | TBA | 49.23 | 40.00 | 54.62 |
211
- | Arithmetic | 27.64 | 22.14 | 11.50 | TBA | 33.15 | 30.16 | 30.34 |
212
- | Cause and Effect | 59.48 | 58.82 | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
213
- | Emotions | 36.25 | 28.12 | 36.25 | TBA | 22.50 | 26.25 | 35.63 |
214
- | Empirical Judgements | 33.33 | 35.35 | 33.33 | TBA | 27.27 | 33.33 | 49.49 |
215
- | General Knowledge | 44.29 | 48.57 | 51.43 | TBA | 40.00 | 24.29 | 51.43 |
216
- | HHH Alignment | 42.09 | 41.66 | 44.23 | TBA | 41.81 | 42.51 | 42.92 |
217
- | Intent Recognition | 24.42 | 26.16 | 43.64 | TBA | 17.49 | 22.40 | 68.35 |
218
- | Misconceptions | 46.27 | 47.01 | 46.27 | TBA | 53.73 | 53.73 | 52.24 |
219
- | Paraphrase | 59.50 | 73.00 | 67.00 | TBA | 51.00 | 50.00 | 51.00 |
220
- | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | TBA | 51.67 | 48.33 | 50.00 |
221
- | Similarities Abstraction | 65.79 | 68.42 | 71.05 | TBA | 60.53 | 65.79 | 60.53 |
222
- | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | TBA | **46.17** | **44.42** | **52.08** |
223
- | **Overall Average** | **36.49** | **34.06** | **29.20** | TBA | **38.93** | **36.50** | **40.00** |
224
 
225
 
226
  3-shot results:
227
 
228
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
229
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
230
- | Analogies | 50.77 | 49.23 | 49.23 | TBA | 40.77 | 54.62 | 76.92 |
231
- | Arithmetic | 38.38 | 43.89 | 20.88 | TBA | 43.63 | 45.78 | 53.68 |
232
- | Cause and Effect | 60.78 | 64.71 | 66.01 | TBA | 64.05 | 58.17 | 67.32 |
233
- | Emotions | 30.00 | 41.25 | 30.00 | TBA | 44.37 | 48.13 | 56.87 |
234
- | Empirical Judgements | 46.46 | 44.44 | 39.39 | TBA | 32.32 | 43.43 | 63.64 |
235
- | General Knowledge | 47.14 | 40.00 | 27.14 | TBA | 54.29 | 28.57 | 74.29 |
236
- | HHH Alignment | 43.53 | 44.80 | 43.80 | TBA | 45.39 | 44.80 | 46.07 |
237
- | Intent Recognition | 20.52 | 44.22 | 36.42 | TBA | 51.45 | 58.82 | 83.67 |
238
- | Misconceptions | 50.75 | 52.24 | 46.27 | TBA | 52.99 | 46.27 | 52.99 |
239
- | Paraphrase | 50.50 | 58.50 | 57.50 | TBA | 53.00 | 54.50 | 55.00 |
240
- | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | TBA | 51.67 | 53.33 | 66.67 |
241
- | Similarities Abstraction | 69.74 | 72.37 | 72.37 | TBA | 64.47 | 73.68 | 75.00 |
242
- | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | TBA | **51.19** | **50.94** | **61.96** |
243
- | **Overall Average** | **42.87** | **47.27** | **33.41** | TBA | **46.99** | **48.07** | **57.36** |
244
 
245
 
246
  As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
@@ -254,31 +254,31 @@ This Ahma 7B base model was also evaluated using [MTBench Finnish by LumiOpen](h
254
 
255
  Single-turn results:
256
 
257
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
258
- |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
259
- | Coding | 1.00 | 1.00 | 1.70 | TBA |
260
- | Extraction | 2.00 | 1.30 | 3.10 | TBA |
261
- | Humanities | 4.05 | 6.20 | 6.60 | TBA |
262
- | Math | 3.00 | 3.20 | 3.90 | TBA |
263
- | Reasoning | 2.90 | 4.60 | 3.70 | TBA |
264
- | Roleplay | 4.80 | 6.50 | 6.60 | TBA |
265
- | STEM | 5.10 | 5.95 | 6.75 | TBA |
266
- | Writing | 6.60 | 9.00 | 7.10 | TBA |
267
- | **Overall Average** | **3.68** | **4.72** | **4.93** | TBA |
268
 
269
  Multi-turn results:
270
 
271
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
272
- |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
273
- | Coding | 1.00 | 1.00 | 1.40 | TBA | 3.70 |
274
- | Extraction | 1.55 | 1.15 | 2.05 | TBA | 6.37 |
275
- | Humanities | 3.25 | 6.20 | 4.95 | TBA | 9.25 |
276
- | Math | 2.20 | 2.70 | 2.50 | TBA | 1.20 |
277
- | Reasoning | 2.45 | 3.50 | 2.55 | TBA | 4.35 |
278
- | Roleplay | 4.90 | 6.40 | 6.35 | TBA | 7.35 |
279
- | STEM | 4.20 | 4.78 | 4.28 | TBA | 7.80 |
280
- | Writing | 3.80 | 6.65 | 4.10 | TBA | 8.50 |
281
- | **Overall Average** | **2.92** | **4.05** | **3.52** | TBA | **6.06** |
282
 
283
  As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
284
 
 
207
 
208
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
209
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
210
+ | Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
211
+ | Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
212
+ | Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
213
+ | Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
214
+ | Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
215
+ | General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
216
+ | HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
217
+ | Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
218
+ | Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
219
+ | Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
220
+ | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
221
+ | Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
222
+ | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
223
+ | **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
224
 
225
 
226
  3-shot results:
227
 
228
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
229
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
230
+ | Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
231
+ | Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
232
+ | Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
233
+ | Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
234
+ | Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
235
+ | General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
236
+ | HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
237
+ | Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
238
+ | Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
239
+ | Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
240
+ | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
241
+ | Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
242
+ | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
243
+ | **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
244
 
245
 
246
  As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
 
254
 
255
  Single-turn results:
256
 
257
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
258
+ |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
259
+ | Coding | 1.00 | 1.00 | 1.70 | 1.10 |
260
+ | Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
261
+ | Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
262
+ | Math | 3.00 | 3.20 | 3.90 | 2.90 |
263
+ | Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
264
+ | Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
265
+ | STEM | 5.10 | 5.95 | 6.75 | 7.30 |
266
+ | Writing | 6.60 | 9.00 | 7.10 | 8.80 |
267
+ | **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
268
 
269
  Multi-turn results:
270
 
271
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
272
+ |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
273
+ | Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
274
+ | Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
275
+ | Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
276
+ | Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
277
+ | Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
278
+ | Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
279
+ | STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
280
+ | Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
281
+ | **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
282
 
283
  As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
284