JEJUMA-001

LLM์œผ๋กœ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์šฐ๋ฆฌ ๋ฐฉ์–ธ ์ง€ํ‚ค๊ธฐ ํ”„๋กœ์ ํŠธ1: ์ œ์ฃผ๋„ ๋ฐฉ์–ธ

์™œ ์‹œ์ž‘ํ•˜๊ฒŒ ๋˜์—ˆ๋‚˜์š”?

๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์ง€์—ญ๋ฐฉ์–ธ: ์ œ์ฃผ๋„

  • ์—ฌ๋Ÿฌ ์ง€์—ญ ๋ฐฉ์–ธ, ํŠนํžˆ ์ œ์ฃผ๋„์˜ ๋ฐฉ์–ธ์ด ๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์œ ๋„ค์Šค์ฝ”๋Š” ์ œ์ฃผ์–ด(์ œ์ฃผ๋ฐฉ์–ธ)์„ ์•„์ฃผ ์‹ฌ๊ฐํ•˜๊ฒŒ ์œ„๊ธฐ์— ์ฒ˜ํ•œ ์–ธ์–ด ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ œ์ฃผ๋„๋ฏผ ์ค‘ ์ œ์ฃผ์–ด๋ฅผ ์•„๋Š” ์‚ฌ๋žŒ์˜ ๋น„์œจ์€ 36.1% ์— ๊ทธ์น˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํŠนํžˆ, ํƒ€์ง€์—ญ๊ณผ์˜ ๊ต๋ฅ˜๊ฐ€ ํ™œ๋ฐœํ•ด์ง€๋ฉด์„œ ์ Š์€ ์ธต์—์„  ์ œ์ฃผ์–ด๋ณด๋‹จ ํ‘œ์ค€์–ด๋ฅผ ์„ ํ˜ธํ•˜๋Š” ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

์ง€์—ญ๋ฐฉ์–ธ์— ์•ฝํ•œ ์–ธ์–ด๋ชจ๋ธ

  • ์˜จ๋ผ์ธ ์†Œ์Šค๋Š” ํ‘œ์ค€์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ์—, ์ž๋ฃŒ๊ฐ€ ์ ์€ ์ง€์—ญ๋ฐฉ์–ธ์„ ์ž˜ ๋ชจ๋ฆ…๋‹ˆ๋‹ค.
  • ํŠนํžˆ ์ œ์ฃผ์–ด๋Š” ํ‘œ์ค€์–ด์™€ ์ฐจ์ด๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์—, ์œ ๋ช…ํ•œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ ์™ธ์—๋Š” ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์–ด๋–ป๊ฒŒ ์ด๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‚˜์š”?

  • ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ์–ด๋ ค์šด ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ์ œ์ฃผ์–ด๊ฐ€ ์žŠํ˜€์ง€์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ํ‘œ์ค€์–ด์˜ ์ œ์ฃผ์–ด ๋ฒ„์ „์„ ์ƒ์„ฑํ•˜์—ฌ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” ๊ธฐ์กด์— ํ•™์Šต๋œ ๋‹ค์–‘ํ•œ ๋‚ด์šฉ์„ ๊ทธ๋Œ€๋กœ ์ด์–ด๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

๊ฐœ๋ฐœํ•œ ์–ธ์–ด๋ชจ๋ธ์— ๋Œ€ํ•œ ์„ค๋ช…

  • ์ œ์ฃผ๋„ ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Llama3.1์„ ๋‹ค์–‘ํ•œ ํ…Œ์Šคํฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ, ์ œ์ฃผ๋„ ๋ฐฉ์–ธ๊ณผ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • JEJUMA-001์€ ํ˜„์žฌ ๋ฐฉ์–ธ๊ณผ ํ‘œ์ค€์–ด๊ฐ„ ๋ณ€๊ฒฝ, ๋ฐฉ์–ธ ํƒ์ง€ ๋“ฑ์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • JEJUMA-001์„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์•ฝ 105๋งŒ๊ฐœ์˜ ์ œ์ฃผ๋ฐฉ์–ธ-์„œ์šธ๋ง ํŽ˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , ๊ทธ ์ค‘ ์ œ์ฃผ์–ด๊ฐ€ ์ž˜ ๋“ค์–ด๋‚œ ๋ฐ์ดํ„ฐ 17๋งŒ๊ฐœ๋ฅผ ์„ ๋ณ„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด 4๊ฐ€์ง€์˜ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ์ด ์•ฝ 34๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
  • LlamaFactory๋ฅผ ํ†ตํ•ด LoRA ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 1์—ํญ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ์–ด๋ ค์šด ์ œ์ฃผ๋„ ๋ง์— ๋Œ€ํ•ด์„œ, gpt4o์™€ ๊ตญ์‚ฐ ๋ชจ๋ธ์ธ ์—…์Šคํ…Œ์ด์ง€ Solar, ๋„ค์ด๋ฒ„ HCX ๋†’์€ ๋ฒˆ์—ญ ์ •ํ™•๋„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

์ œ์ฃผ์–ด -> ํ‘œ์ค€์–ด

์ž…๋ ฅ ๋ฌธ์žฅ ์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค
์ •๋‹ต ์žฌ ํŒ”์— ๋‹ญ์‚ด์ด ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ, ์ถ”์šด ๋ชจ์–‘์ด๋‹ค.
Upstage Solar ์ถœ๋ ฅ ๊ทธ ๋ฐ”์œ„์— ๋ฑ€์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค.
Naver HCX ์ถœ๋ ฅ ์žฌ์˜ ํ’€์— ๋…์ดˆ๊ฐ€ ๋งˆ๊ตฌ ๋‚œ ๊ฒƒ์„ ๋ณด๋‹ˆ ์–ด๋ฆฐ ์†Œ๋‚˜๋ฌด์ž…๋‹ˆ๋‹ค.
GPT-4o ์ถœ๋ ฅ ์ €๊ธฐ ๋ฐ”์œ„์— ๋…์‚ฌ๊ฐ€ ๋ง‰ ๋‚˜ํƒ€๋‚œ ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค.
JEJUMA-001 ์ถœ๋ ฅ

ํ‘œ์ค€์–ด -> ์ œ์ฃผ์–ด

์ž…๋ ฅ ๋ฌธ์žฅ ๊ทค๋‚˜๋ฌด์— ๊ทธ๋ƒฅ ๊ฐ€์„œ ๋„ˆ๋„ค ์•„๋ฒ„์ง€์ข€ ์ฐพ์•„์™€๋ผ.
์ •๋‹ต ๋ฏธ๊นก๋‚ญ ๊ฒฝ ๊ฐ€์‹ฌ ๋„ˆ๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น
Upstage Solar ์ถœ๋ ฅ ๊ทค ๋‚˜๋ฌด์— ๊ฐ€์„œ ๋„ค ์•„๋ฒ„์ง€๋ฅผ ์ข€ ์ฐพ์•„์™€.
Naver HCX ์ถœ๋ ฅ ๊ทค๋‚ญ์— ๊ฐ• ๋Š๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น์˜ค๋ผ.
GPT-4o ์ถœ๋ ฅ ๊ทค๋‚˜๋ฌด์— ๊ฑ ๊ฐ€์„œ ํ–„์‹  ์•„๋ฐฉ ์ข€ ์ฐพ์•„์™€๋ผ.
JEJUMA-001 ์ถœ๋ ฅ ๋ฏธ๊นก๋‚ญ์— ๊ทธ๋ƒฅ ๊ฐ• ๋„ˆ๋„ค ์•„๋ฐฉ์ข€ ์ดž์•„์˜ค๋ผ

์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋‚˜์š”?

  • ์ •์˜๋œ ํƒฌํ”Œ๋ฆฟ์—์„œ dialect_to_standard, standard_to_dialect, detect_dialect, detect_dialect_and_convert ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • dialect_to_standard: ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝ
  • standard_to_dialect: ํ‘œ์ค€์–ด๋ฅผ ์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
  • detect_dialect: ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด ๊ฐ์ง€
  • detect_dialect_and_convert: ์ž๋™์œผ๋กœ ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด๋ฅผ ๊ฐ์ง€ํ•˜์—ฌ ํ‘œ์ค€์–ด/์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
import transformers
import torch

model_id = "JEJUMA/JEJUMA-001"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

class JejuPromptTemplate:
    @staticmethod
    def dialect_to_standard(text):
        return [{"role":"user", "content":"Convert the following sentence or word which is Jeju island dialect to standard Korean: " + text},]

    @staticmethod
    def standard_to_dialect(text):
        return [{"role":"user", "content":"Convert the following sentence or word which is standard Korean to Jeju island dialect: " + text},]

    @staticmethod
    def detect_dialect(text):
        return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean: " + text},]

    @staticmethod
    def detect_dialect_and_convert(text):
        return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean and convert the following sentence or word to Jeju island dialect or standard Korean: " + text},]


outputs = pipeline(
    JejuPromptTemplate.standard_to_dialect("์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค"),
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
)

print(outputs[0]["generated_text"][-1])

์ถ”ํ›„ ๊ณ„ํš

  • JEJUMA-002๋Š” ๊ตญ๋‚ด์˜ ๋ชจ๋“  ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•˜์—ฌ ๋™์ผํ•œ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.
  • JEJUMA-003๋Š” ๊ตญ๋‚ด์˜ ๋ชจ๋“  ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ์™€ ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ํ…Œ์Šคํฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์— ์—†๋Š” ๋ฐฉ์–ธ(์—ฐ๋ณ€๋ฐฉ์–ธ, ๋ถํ•œ์–ด, ์ œ3์˜ ์–ธ์–ด)๋ฅผ ์ผ๋ถ€ ๋ฒˆ์—ญํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.
  • JEJUMA-003์ด ๋ณธ ์—ฐ๊ตฌ์— ์ตœ์ข… ๋‹จ๊ณ„์ด๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ๋ฒˆ์—ญ๋ชจ๋ธ์ด๋‚˜ ๋” ์ž‘์€ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ 8B ํฌ๊ธฐ์˜ ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
Downloads last month
17
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for JEJUMA/JEJUMA-001

Quantizations
1 model