beomi's picture
Update README.md
c8306dc verified
metadata
language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - solar
  - mistral
  - pytorch
  - solar-ko
library_name: transformers
license: apache-2.0
base_model: upstage/SOLAR-10.7B-v1.0

Update Log

  • 2024.07.01: Released Solar-Ko-Recovery & Uploaded Benchmark scores
  • 2024.05.16: Preview Released Solar-Ko-Recovery

Solar-Ko-Recovery-11B 🌟❤️‍🩹

Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

Model Details

Model Developers: Junbum Lee (Beomi)

Variations: Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

Training Data Parameters Content Length GQA Tokens Learning Rate
Solar-Ko-Recovery A curated mix of Korean+English Corpora 11B(10.99B) 4k O >100B* 5e-5

NOTE: 2-step training processed

  1. Only Embedding layer and LM Head layer are trained
  2. Full params trained

Vocab Expansion

Vocab expansion is conducted on edited upstage/solar-1-mini-tokenizer, which is superset of Solar tokenizer.

Model Name Vocabulary Size Description
Original Solar 32000 Sentencepiece BPE
solar-1-mini-tokenizer 64000 Sentencepiece BPE. Added Ko/JP vocabs

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

  • SOLAR-10.7B: 26 tokens
  • Solar-Ko-Recovery: 7 tokens
Model Tokens
SOLAR-10.7B ['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']
Solar-Ko-Recovery ['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

  • SOLAR-10.7B: 22 tokens
  • Solar-Ko-Recovery: 22 tokens
Model Tokens
SOLAR-10.7B ['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']
Solar-Ko-Recovery ['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean

Tasks Metric Value Stderr
haerae acc_norm 0.7874 ± 0.0118
- haerae_general_knowledge acc 0.5000 ± 0.0378
- haerae_history acc 0.8723 ± 0.0244
- haerae_loan_word acc 0.8402 ± 0.0283
- haerae_rare_word acc 0.8346 ± 0.0185
- haerae_standard_nomenclature acc 0.8301 ± 0.0305
kmmlu_direct exact_match 0.4205 ± 0.0026
- kmmlu_direct_accounting exact_match 0.3700 ± 0.0485
- kmmlu_direct_agricultural_sciences exact_match 0.3140 ± 0.0147
- kmmlu_direct_aviation_engineering_and_maintenance exact_match 0.3870 ± 0.0154
- kmmlu_direct_biology exact_match 0.3510 ± 0.0151
- kmmlu_direct_chemical_engineering exact_match 0.3910 ± 0.0154
- kmmlu_direct_chemistry exact_match 0.4000 ± 0.0200
- kmmlu_direct_civil_engineering exact_match 0.4010 ± 0.0155
- kmmlu_direct_computer_science exact_match 0.6520 ± 0.0151
- kmmlu_direct_construction exact_match 0.3080 ± 0.0146
- kmmlu_direct_criminal_law exact_match 0.3100 ± 0.0328
- kmmlu_direct_ecology exact_match 0.4660 ± 0.0158
- kmmlu_direct_economics exact_match 0.5385 ± 0.0439
- kmmlu_direct_education exact_match 0.6200 ± 0.0488
- kmmlu_direct_electrical_engineering exact_match 0.3000 ± 0.0145
- kmmlu_direct_electronics_engineering exact_match 0.4740 ± 0.0158
- kmmlu_direct_energy_management exact_match 0.3560 ± 0.0151
- kmmlu_direct_environmental_science exact_match 0.2980 ± 0.0145
- kmmlu_direct_fashion exact_match 0.4470 ± 0.0157
- kmmlu_direct_food_processing exact_match 0.3690 ± 0.0153
- kmmlu_direct_gas_technology_and_engineering exact_match 0.3000 ± 0.0145
- kmmlu_direct_geomatics exact_match 0.3820 ± 0.0154
- kmmlu_direct_health exact_match 0.5700 ± 0.0498
- kmmlu_direct_industrial_engineer exact_match 0.3830 ± 0.0154
- kmmlu_direct_information_technology exact_match 0.6090 ± 0.0154
- kmmlu_direct_interior_architecture_and_design exact_match 0.5440 ± 0.0158
- kmmlu_direct_korean_history exact_match 0.3800 ± 0.0488
- kmmlu_direct_law exact_match 0.4670 ± 0.0158
- kmmlu_direct_machine_design_and_manufacturing exact_match 0.3960 ± 0.0155
- kmmlu_direct_management exact_match 0.5030 ± 0.0158
- kmmlu_direct_maritime_engineering exact_match 0.4283 ± 0.0202
- kmmlu_direct_marketing exact_match 0.7460 ± 0.0138
- kmmlu_direct_materials_engineering exact_match 0.4020 ± 0.0155
- kmmlu_direct_math exact_match 0.2867 ± 0.0262
- kmmlu_direct_mechanical_engineering exact_match 0.3490 ± 0.0151
- kmmlu_direct_nondestructive_testing exact_match 0.3760 ± 0.0153
- kmmlu_direct_patent exact_match 0.3700 ± 0.0485
- kmmlu_direct_political_science_and_sociology exact_match 0.5300 ± 0.0289
- kmmlu_direct_psychology exact_match 0.4470 ± 0.0157
- kmmlu_direct_public_safety exact_match 0.3520 ± 0.0151
- kmmlu_direct_railway_and_automotive_engineering exact_match 0.3220 ± 0.0148
- kmmlu_direct_real_estate exact_match 0.4350 ± 0.0351
- kmmlu_direct_refrigerating_machinery exact_match 0.3240 ± 0.0148
- kmmlu_direct_social_welfare exact_match 0.4970 ± 0.0158
- kmmlu_direct_taxation exact_match 0.3800 ± 0.0344
- kmmlu_direct_telecommunications_and_wireless_technology exact_match 0.5480 ± 0.0157
kobest_boolq acc 0.9202 ± 0.0072
f1 0.9202 ± N/A
kobest_copa acc 0.8680 ± 0.0107
f1 0.8678 ± N/A
kobest_hellaswag acc 0.5560 ± 0.0222
f1 0.5520 ± N/A
acc_norm 0.6540 ± 0.0213
kobest_sentineg acc 0.9824 ± 0.0066
f1 0.9824 ± N/A

Citation

TBD

Acknowledgements