---
library_name: transformers
license: apache-2.0
datasets:
- werty1248/s1k-1.1-Ko-ReGenerated-Formatted
language:
- ko
base_model:
- Qwen/Qwen2.5-32B-Instruct
pipeline_tag: text-generation
---

### Korean S1K-1.1

- Generation Result: [werty1248/qwen-32b-s1.1-Ko-Native-result](https://huggingface.co/datasets/werty1248/qwen-32b-s1.1-Ko-Native-result)
  - AIME2024(Korean): 30% (9/30) with max_tokens=8192
- Chinese appears in think tokens (Not halucination or failure behavior. Solved correctly in Chinese, then answered in Korean)

### Training Details

- Train with [s1k official code](https://github.com/simplescaling/s1)
  - ```block_size = 20000```
  - ```gradient_checkpointing=True```
- Training Dataset: [werty1248/s1k-1.1-Ko-ReGenerated-Formatted](https://huggingface.co/datasets/werty1248/s1k-1.1-Ko-ReGenerated-Formatted)
  - Translated questions from [s1k-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1) into Korean, and then regenearted using Deepseek-R1.
  - Need to modify "text" column to Qwen format
- 8x H200 SXM, 2 hours; $63.84, except for the cost of trial and error :(
- 
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6629154d55d7c289634b8c5d/1jtNULRLJOhc7O8062_FN.png)

### Evaluation

- [HRM8K](https://arxiv.org/abs/2501.02448) - Korean Math Benchmark

**Generated and evaluated with my own code; accuracy may differ.**

| Model | GSM8K | KSM | MATH | OMNI_MATH |
|-------|-------|-------|-------|-------|
| **Qwen2.5-32B-s1.1-Ko-Native** | 89.92 | 39.85 | 87.73 | 42.06 |
| *GPT-4o | 91.21 | 22.83 | 74.45 | 30.75 |
| *GPT-4o-mini | 87.57 | 19.40 | 70.68 | 26.45 |
| **EXAONE-3.5-7.8B-Stratos-Ko** | 83.02 | 15.97 | 67.49 | 24.62 |
| **Qwen2.5-7B-s1.1-Ko-Native** | 76.27 | 15.48 | 66.45 | 23.57 |
| EXAONE-3.5-7.8B-Instruct | 81.58 | 14.71 | 63.50 | 21.69 |
| *Qwen2.5-14B-Instruct | 66.34 | 15.55 | 53.38 | 20.64 |
| *Llama-3.1-8B-Instruct | 77.79 | 7.21 | 49.01 | 15.92 |
| *Qwen2.5-7B-Instruct | 58.38 | 13.10 | 48.04 | 16.55 |
| *EXAONE-3.0-7.8B-Instruct | 72.33 | 7.98 | 46.79 | 15.35 |
| *Ko-R1-1.5B-preview | 43.3 | ? | 73.1 | 29.8 |

\* Reported by [HRM8K authors](https://www.onelineai.com/blog/hrm8k)

- Generation
  - ```temperature = 0.7```
  - ```top_p = 0.95```
  - ```max_tokens = 8192```
  - If it exceeded the maximum tokens but did not generate ```</think>``` tokens, add ```</think>``` tokens and generate 512 additional tokens
- Evaluation
  - Custom parser & evaluation code; may have parsed incorrectly

---

Why Qwen? Why EXAONE can't?