--- library_name: transformers license: apache-2.0 datasets: - werty1248/s1k-1.1-Ko-ReGenerated-Formatted language: - ko base_model: - Qwen/Qwen2.5-32B-Instruct pipeline_tag: text-generation --- ### Korean S1K-1.1 - Generation Result: [werty1248/qwen-32b-s1.1-Ko-Native-result](https://huggingface.co/datasets/werty1248/qwen-32b-s1.1-Ko-Native-result) - AIME2024(Korean): 30% (9/30) with max_tokens=8192 - Chinese appears in think tokens (Not halucination or failure behavior. Solved correctly in Chinese, then answered in Korean) ### Training Details - Train with [s1k official code](https://github.com/simplescaling/s1) - ```block_size = 20000``` - ```gradient_checkpointing=True``` - Training Dataset: [werty1248/s1k-1.1-Ko-ReGenerated-Formatted](https://huggingface.co/datasets/werty1248/s1k-1.1-Ko-ReGenerated-Formatted) - Translated questions from [s1k-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1) into Korean, and then regenearted using Deepseek-R1. - Need to modify "text" column to Qwen format - 8x H200 SXM, 2 hours; $63.84, except for the cost of trial and error :( - ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6629154d55d7c289634b8c5d/1jtNULRLJOhc7O8062_FN.png) ### Evaluation - [HRM8K](https://arxiv.org/abs/2501.02448) - Korean Math Benchmark **Generated and evaluated with my own code; accuracy may differ.** | Model | GSM8K | KSM | MATH | OMNI_MATH | |-------|-------|-------|-------|-------| | **Qwen2.5-32B-s1.1-Ko-Native** | 89.92 | 39.85 | 87.73 | 42.06 | | *GPT-4o | 91.21 | 22.83 | 74.45 | 30.75 | | *GPT-4o-mini | 87.57 | 19.40 | 70.68 | 26.45 | | **EXAONE-3.5-7.8B-Stratos-Ko** | 83.02 | 15.97 | 67.49 | 24.62 | | **Qwen2.5-7B-s1.1-Ko-Native** | 76.27 | 15.48 | 66.45 | 23.57 | | EXAONE-3.5-7.8B-Instruct | 81.58 | 14.71 | 63.50 | 21.69 | | *Qwen2.5-14B-Instruct | 66.34 | 15.55 | 53.38 | 20.64 | | *Llama-3.1-8B-Instruct | 77.79 | 7.21 | 49.01 | 15.92 | | *Qwen2.5-7B-Instruct | 58.38 | 13.10 | 48.04 | 16.55 | | *EXAONE-3.0-7.8B-Instruct | 72.33 | 7.98 | 46.79 | 15.35 | | *Ko-R1-1.5B-preview | 43.3 | ? | 73.1 | 29.8 | \* Reported by [HRM8K authors](https://www.onelineai.com/blog/hrm8k) - Generation - ```temperature = 0.7``` - ```top_p = 0.95``` - ```max_tokens = 8192``` - If it exceeded the maximum tokens but did not generate `````` tokens, add `````` tokens and generate 512 additional tokens - Evaluation - Custom parser & evaluation code; may have parsed incorrectly --- Why Qwen? Why EXAONE can't?