Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for KEByT5-large (1.23B #params)

KEByT5: Korean-Enhanced/Enriched Byte-level Text-to-Text Transfer Transformer(T5)

ํฌ๋กœ์Šค๋ชจ๋‹ฌ ๋ฐ ๋‹ค๊ตญ์–ด ์นœํ™”์ ์ธ ํ•œ๊ตญ์–ด ์ค‘์‹ฌ์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด ์ดํ•ด ์ƒ์„ฑ ๋ชจ๋ธ (EN=Cross-modal, Multilingual Friendly, Token-free Encoder-Decoder Pretrained Language Model for Korean)

  • ๋ณธ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ ์‹œ๊ฐ, ์ฒญ๊ฐ๊ณผ ๊ฐ™์€ ํ…์ŠคํŠธ ์ด์™ธ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์™€ ๊ต์ฐจ์–ธ์–ด ์ง€์‹ ๊ตํ™˜์— ์šฉ์ดํ•œ ํ† ํฐ-ํ”„๋ฆฌ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ณ„๋„์˜ tokenizer๊ฐ€ ํ•„์š”์—†์ง€๋งŒ, ํŽธ์˜๋ฅผ ์œ„ํ•ด AutoTokenizer.from_pretrained()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ํ† ํฌ๋‚˜์ด์ € ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ๊ณผ ๋™์ผํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ƒ๋žตํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ, UTF-8 ์ž…๋ ฅ์„ ๋ฐ”์ดํŠธ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ์–ด, ๊ฐ ๋ฐ”์ดํŠธ์— +3์„ ํ•˜์—ฌ Token ID๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (์ฆ‰, ASCII value 0 == Token ID 3, ASCII value 255 == Token ID 258)
  • ํ˜„์žฌ Preview ์Šคํ…Œ์ด์ง€์— ์žˆ๋Š” ๋ชจ๋ธ์ด๋ฉฐ, ํ™œ์šฉ์—๋Š” fine-tuning์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Acknowledgements

  • ๋ณธ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ 2022๋…„๋„ ์ •๋ถ€(๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€)์˜ ์žฌ์›์œผ๋กœ ์ •๋ณดํ†ต์‹ ๊ธฐํšํ‰๊ฐ€์›์˜ ์ง€์›์„ ๋ฐ›์•„ ์ˆ˜ํ–‰๋œ ์—ฐ๊ตฌ์ž„ (No. RS-2022-00187238, ํšจ์œจ์  ์‚ฌ์ „ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ํ•œ๊ตญ์–ด ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ ์‚ฌ์ „ํ•™์Šต ๊ธฐ์ˆ  ๊ฐœ๋ฐœ) (EN=This pretrained language model was supported by the Institute of Information & communication Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training))

Model Details

๋ณธ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทœ๋ชจ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

  • kebyt5-small : 330M link
  • kebyt5-base : 580M link
  • kebyt5-large : 1.23B (this model)

์ด๋“ค ๋ชจ๋ธ์€ google/byt5-small, google/byt5-base, google/byt5-large ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์™€ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ํ† ํฌ๋‚˜์ด์ €(ByT5Tokenizer)์™€ ๊ตฌํ˜„ ์ƒ ๋‘ ๋ชจ๋ธ์€ ๋ณ„๋„์˜ ์ˆ˜์ •์—†์ด ๋ฐ”๋กœ ๊ตํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. huggingface transformers์—์„œ์˜ ์‚ฌ์šฉ๋ฒ• ์—ญ์‹œ, T5ForConditionalGeneration์„ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Model Description

  • Developed by: Language Intelligence Research Section, Electronics and Telecommunications Research Institute(ETRI)
  • Model type: Encoder-Decoder Transformer, specifically, ByT5.
  • Language(s) (NLP): Korean, English(partially for translation task), Chinese(partially for translation task), Japanese(partially for translation task).
  • License: Apache 2.0 License
  • Finetuned from model: kebyt5-small/-base/-xl model weights were initialized by google/byt5-* for Warm-start pretraining.

Model Sources

  • Repository: ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ํ•™์Šต์„ ์œ„ํ•ด, https://github.com/etri-crossmodal/llm-downstream-s2s
  • Paper: ์‹ ์ข…ํ›ˆ ์™ธ, "ํ•œ๊ตญ์–ด ์ค‘์‹ฌ์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด ์ดํ•ด-์ƒ์„ฑ ๋ชจ๋ธ ์‚ฌ์ „ํ•™์Šต ์—ฐ๊ตฌ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘, pp.711-715. 2023. (EN=Shin et al., "Towards Korean-Centric Token-free Pretrained Language Model", in Procs. of the 35th Annual Conference on Human and Cognitive Language Technology. pp. 711-715. 2023.)

Uses

ํ•ด๋‹น ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ ์—ฐ๊ตฌ ๋ฐ ๊ต์œก ๋ชฉ์ ์˜ ํ™œ์šฉ์œผ๋กœ ๊ทธ ์‚ฌ์šฉ ๋ชฉ์ ์ด ์ œํ•œ๋ฉ๋‹ˆ๋‹ค.

Direct Use

ํ˜„์žฌ ๊ณต๊ฐœ๋˜๋Š” ๋ชจ๋ธ์€ T5 ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ Corrupted span denoising ๋งŒ์œผ๋กœ ํ•™์Šต๋˜์–ด ์žˆ์–ด, ์‹ค์ œ ์‘์šฉ ํƒœ์Šคํฌ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” fine-tuning ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Sentinel Token(token id 258, 257, 256, ...)์„ ์‚ฌ์šฉํ•˜์—ฌ Masked Token Prediction์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ์˜ˆ์ธก๋œ ๋‚ด์šฉ์—๋Š” ๋ถ€์ ์ ˆํ•œ ๋‚ด์šฉ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Downstream Use

Token-free ๋ชจ๋ธ์˜ ํŠน์„ฑ ์ƒ, ๋ณต์žกํ•˜๊ฑฐ๋‚˜ Noisyํ•œ ์ž…๋ ฅ์— ๊ฐ•๊ฑดํ•˜๋ฉฐ, ์งง์€ ์‹œํ€€์Šค ๊ธธ์ด์˜ ์ƒ์„ฑ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: ์–ธ์–ด ์ดํ•ด, ๋Œ€ํ™” ์‘๋‹ต ์ƒ์„ฑ)

์‚ฌ์ „ํ•™์Šต์€ 1024 bytes ๊ธธ์ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ธด ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฌธ์ œ์— ์ ํ•ฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ค„์•ผ ํ•˜๋Š” ๋ฌธ์ œ์—์„œ๋Š”, GBST ๊ธฐ๋ฐ˜์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

Bias, Risks, Limitations, and Recommendations

Masked Token Prediction์„ ํ†ตํ•ด ํš๋“๋  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด์—๋Š” ๋‹ค๋ฅธ ์ƒ์„ฑํ˜• ์–ธ์–ด๋ชจ๋ธ๊ณผ ๊ฐ™์€ ์œ„ํ—˜์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋Š” ์š•์„ค, ์Œ๋ž€, ์ •์น˜์  ๋‚ด์šฉ ๋ฐ ๊ธฐํƒ€ ๊ฑฐ์นœ ์–ธ์–ด๋“ค์— ๋Œ€ํ•œ ๋ณ„๋„์˜ ์ฒ˜๋ฆฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์‚ฌํšŒ์ ์œผ๋กœ ์šฉ์ธ๋˜์ง€ ์•Š์€ ํ† ํฐ์ด๋‚˜ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ฃผ๋ณ€ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ์„œ ๊ณต๊ฒฉ์ ์ธ ์ž…๋ ฅ์— ์–ด๋– ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์„์ง€ ์‰ฝ๊ฒŒ ์˜ˆ์ƒํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ, ๋ณธ ์–ธ์–ด๋ชจ๋ธ์€ ์ฃผ๋กœ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, ์ด๋“ค์˜ ํŠน์„ฑ์„ ์ „์ดํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ, ๊ทธ ์ค‘์—์„œ๋„ ๋ถ„๋ฅ˜, ์š”์•ฝ, ์งง์€ ๋ฌธ์žฅ ์ƒ์„ฑ์— ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž…์ถœ๋ ฅ ์ˆ˜์ค€์—์„œ ๋ฏธ๋“ฑ๋ก์–ด(Out-of-Vocabulary)๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ์—†์œผ๋‚˜, ์‚ฌ์ „ํ•™์Šต๋˜์ง€ ์•Š์€ ํ…์ŠคํŠธ ์‹œํ€€์Šค์— ๋Œ€ํ•ด์„œ๋Š” ์ถ”๊ฐ€์˜ ๋„๋ฉ”์ธ ์ ์‘ ํ•™์Šต ๋ฐ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์˜ ๋ฏธ์„ธ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

[More Information Needed]

How to Get Started with the Model

Transformers 4.27.0 ์ด์ƒ์˜ ๋ฒ„์ „์—์„œ, ๋‹ค์Œ์˜ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ๊ณผ tokenizer๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("etri-lirs/kebyt5-small-preview")
model = AutoModelForSeq2SeqLM.from_pretrained("etri-lirs/kebyt5-small-preview")

Training Details

Training Data

๋ณธ ์‚ฌ์ „ํ•™์Šต์—๋Š” ์•„๋ž˜์˜ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ์‹ ๋ฌธ v2.0
  • ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ๊ตฌ์–ด ๋ง๋ญ‰์น˜ v1.2
  • ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ๋ฌธ์–ด ๋ง๋ญ‰์น˜ v1.0
  • ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ์‹ ๋ฌธ 2020 v1.0
  • ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ์‹ ๋ฌธ 2021 v1.0
  • ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์–ด ๋คํ”„, v2020.09.20
  • ๋‚˜๋ฌด์œ„ํ‚ค ๋คํ”„
  • ํ•œ๊ตญ์ •๋ณดํ™”์ง„ํฅ์›, AIHub. ์ „๋ฌธ๋ถ„์•ผ ๋ง๋ญ‰์น˜, ๋ฒ•๋ฅ /ํŠนํ—ˆ ์ง€์‹๋ฒ ์ด์Šค, ๋…ผ๋ฌธ/๋„์„œ/๋Œ€ํ™”/๋Œ€๋ณธ ์š”์•ฝ, ํ•œ์˜/ํ•œ์ผ/ํ•œ์ค‘ ๋ฒˆ์—ญ ๋ง๋ญ‰์น˜, ์ฝœ์„ผํ„ฐ/์ฃผ๋ฌธ/๋‰ด์Šค๊ธฐ์‚ฌ/์‹œ๊ฐ์ •๋ณด ์งˆ์˜์‘๋‹ต, ๋ฐฉ์†ก/ํšŒ์˜/์ƒ๋‹ด ์Œ์„ฑ์ธ์‹ ๋ฐ์ดํ„ฐ.
  • ํ•œ๊ตญ์ •๋ณดํ™”์ง„ํฅ์›, AIHub. ๋Œ€๊ทœ๋ชจ ์›น๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ
  • ํ•œ๊ตญ์ •๋ณดํ™”์ง„ํฅ์›, AIHub. ์˜จ๋ผ์ธ ๊ตฌ์–ด์ฒด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ.
  • KcBERT ๋ง๋ญ‰์น˜, v2022.3Q

๋˜ํ•œ, ์†Œ๋Ÿ‰์˜ ์ž์ฒด ๊ตฌ์ถ•๋œ ๋ฐ์ดํ„ฐ ๋ฐ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ผ๋ถ€๋ฅผ ์‚ฌ์šฉ, ์ „์ฒด ์•ฝ ~220GB ๊ฐ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Evaluation

Testing Data, Factors & Metrics & Results

ํ•œ๊ตญ์–ด ์–ธ์–ด ์ดํ•ด ํƒœ์Šคํฌ์— ์‚ฌ์šฉ๋˜๋Š” KLUE dataset, v1.1์˜ dev set์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ƒ์„ฑ์€ ๋ชจ๋‘ seq2seq์„ ์ด์šฉํ•œ ์ถœ๋ ฅ ๋ ˆ์ด๋ธ” ์ง์ ‘ ์ƒ์„ฑ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

models KLUE-TC(YNAT) (F1) KLUE-NER (Entity, Char F1) KLUE-DP (UAS, LAS) KLUE-MRC (EM, ROUGE-W)
google/byt5-large (1.23B) 78.52 48.81, 63.95 44.26, 7.805 NOT TESTED
KEByT5-Base (580M) 84.99 86.75, 91.05 88.70, 85.90 62.28, 68.38
KEByT5-Large (1.23B) 85.68 88.09, 92.40 87.18, 85.52 70.07, 75.81
GBST-KEByT5-Base (584M) 85.29 87.35, 92.09 88.33, 85.00 59.69, 66.44

๋Œ€ํ™” ์ƒํƒœ ์ถ”์ (DST; Dialogue State Tracking) ํƒœ์Šคํฌ์ธ KLUE-WOS-v1.1 ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ‰๊ฐ€๋Š” ๋ชจ๋‘ seq2seq์„ ์ด์šฉํ•œ ๋‹ค์ด์–ผ๋กœ๊ทธ ์ƒํƒœ ์ง์ ‘ ์ƒ์„ฑ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค:

models WOS (JGA, %) WOS (F1, %)
klue/klue-roberta-large 50.22 92.23
KEByT5-Base (580M) 77.15 96.92
KEByT5-Large (1.23B) 78.54 97.28

๊ด€๊ณ„ ์ถ”์ถœ(RE; Relation Extraction) ํƒœ์Šคํฌ์ธ KLUE-RE-v1.1 ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. no_relation์„ ์ œ์™ธํ•œ 29๊ฐœ์˜ ๊ด€๊ณ„ ํด๋ž˜์Šค์— ๋Œ€ํ•œ Micro F1 ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค:

models KLUE-RE (F1, %)
klue/klue-roberta-base 65.90
KEByT5-Base (580M) 65.48
KEByT5-Large (1.23B) 68.95

Compute Infrastructure

  • Trained on nVidia A100 80GB * 4EA

Citation

  • ํ—ˆ์ • ์™ธ, "์ƒ์„ฑํ˜• ์–ธ์–ด๋ชจ๋ธ์„ ์ด์šฉํ•œ ๊ด€๊ณ„ ์ถ”์ถœ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘. pp.708-710. 2023.
  • ์ด๊ธฐ์˜ ์™ธ, "ํ•œ๊ตญ์–ด ํ† ํฐ-ํ”„๋ฆฌ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ KeByT5๋ฅผ ์ด์šฉํ•œ ํ•œ๊ตญ์–ด ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ์ƒํƒœ ์ถ”์ ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘. pp.644-647. 2023.

Model Card Authors/Contacts

Jong-hun Shin(ETRI), e-mail=jhshin82 AT etri DOT re DOT kr.

Downloads last month
4
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including etri-lirs/kebyt5-large-preview