Mixnu

Mixnu is a Korean Large Language Model with the Mixture of Experts architecture.

Model Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("seemdog/Mixnu")
model = AutoModelForCausalLM.from_pretrained("seemdog/Mixnu")

Tokenizer

Mixnu uses an extended version of Llama-2 tokenizer.
The table below compares the vocabulary size, percentage of unkown tokens(tokens segmeted in byte-level), and the average number of tokens per test sentence. The test sentences consist of carefully curated Korean text encompassing various domains(e.g. law, medicine, finance, social studies etc.)

Model Vocab size % of UNK token Avg. num of tokens
Llama-2 32,000 48.61 325.27
Mixnu 46,262 0.96 125.91

Model Architecture

Employing Llama-2 7b with the extended tokenizer as the foundational model, we develop four different expert models with four separate train datasets.
As the figure shows, the four expert models are then incorporated into one model using the Mixtral Architecture, resulting in a 19b MoE model.

architecture

Qualitative Evaluation

Pipeline

Primarily using chatGPT(GPT-4o) for evaluation and then verifying the results with human annotators, we evalaute the model in five categories:

Language: When adapting the existing model to a new language, we do not want the model to forget information about the previously learned language. Therefore, we evaluate whether the model responds well in the original language when given a prompt in that language. The attribute used in the language category is 'English'. For Korean, since the language used in the prompts for the other categories introduced below is Korean, it is not separately included in this category.

Style: A sentence can be written in a literary or colloquial style. Generally, unless the discourse context changes, maintaining consistency in style across multiple sentences is considered natural. Therefore, we evaluate whether the model continues to generate sentences in the same style as the prompt when given a literary or colloquial prompt. The attributes used in the style category are 'literary' and 'colloquial'.

Domain: Models can learn knowledge from various fields simultaneously. It is essential for us to specifically examine which knowledge the model possesses. Particularly, we focus on evaluating knowledge in specialized fields, paying attention to the potential application of LLMs in those domains. The attributes used in the domain category are 'daily life', 'medicine', 'law', 'technology', and 'finance'. When constructing prompts for domains except for 'daily life', we utilized excerpts from actual newspaper articles in each domain.

Localization: In adaptation of language models to other languages, linguistic transitions are crucial. However, generating sentences that well reflect the culture of the country using that language is the true measure of the model's sentence generation ability. In this qualitative assessment, we evaluate the degree of 'Koreanization', how well the trained model understands the unique expressions in Korean and Korean cultures in the prompts. Attributes used in localization are 'grammar', 'proverbs', 'idioms', and 'culture'.

Hallucination: Models can generate sentences that seem plausible outwardly but are actually inaccurate or unrelated with respect to the input data. Because this occurs in a subtle manner, it can be challenging to capture through quantitative evaluation, making qualitative assessment particularly crucial. We include words in the prompt that can easily be confused with other words due to their syllabic overlap, observing whether the model grasps the accurate meaning of these words and generates relevant sentences. The attribute used here is simply 'hallucination' itself.

category

Prompt

{
"role": "system", "content": "You are a strict evaluator of text generation models."

            
Please strictly evaluate the performance of three generative models using qualitative evaluation. Provide the fluency, relevance, and accuracy score of model A, B, C, D, and E.


Fluency Rubric:
    5 points: Sentences are fluent without noise.
    4 points: Sentences contain some noise that does not significantly degrade overall fluency, or relevant noise such as PII (e.g., phone numbers, addresses).
    3 points: Sentences are generally coherent but contain unnecessary and irrelevant noise.
    2 points: Severe noise is present (e.g., word repetition, symbols). The output makes sense at the phrase level but not at the sentence level.
    1 point: The output makes no sense at all.

Relevance Rubric:
    5 points: Sentences fully answer the prompt's question.
    4 points: Sentences include a direct answer to the prompt's question, though not entirely consistent or coherent.
    3 points: Sentences provide an indirect and low-quality answer to the prompt's question.
    2 points: Sentences do not answer the prompt's question but are relevant to the topic.
    1 point: Sentences are not relevant to the prompt.

Accuracy Rubric:
    5 points: All information and details are completely accurate with no errors or distortions.
    4 points: Most information is accurate with only minor errors that do not significantly affect understanding.
    3 points: The main information is accurate, but there are some significant errors.
    2 points: There are numerous errors, and only some of the information is accurate.
    1 point: The information is mostly incorrect or misleading, with very few accurate details.

Prompt: {Prompt}

A: {Response from Model A}

B: {Response from Model B}

C: {Response from Model C}

D: {Response from Model D}

E: {Response from Model E}

Deduct fluency points if the model generates output in a different language from the prompt.
}

Evaluation Results

image

Response Examples

Prompt

์ด๋ฒˆ์— ์™„์ „ ๋ถˆ์ˆ˜๋Šฅ์ด์—ˆ์–ด์š”.

Responses

Mixnu
๊ตญ์–ด๋Š” ์ž‘๋…„๋ณด๋‹ค ์–ด๋ ค์›Œ์กŒ๊ณ , ์˜์–ด๋Š” ์ž‘๋…„ ์ˆ˜๋Šฅ๋ณด๋‹ค ์‰ฌ์›Œ์กŒ๋‹ค๊ณ  ํ•˜์ง€๋งŒ, ์ˆ˜ํ—˜์ƒ๋“ค์ด ์ฒด๊ฐํ•˜๋Š” ๋‚œ๋„๋Š” ํ›จ์”ฌ ๋” ์–ด๋ ค์› ์„ ๊ฒ๋‹ˆ๋‹ค.

Llama-2 7b
๋ถˆ์ˆ˜๋Šฅ์ด๋ž€ ๋ถˆ์ˆ˜ ์—†์–ด์„œ ํ•  ์ˆ˜ ์—†์—ˆ๋˜ ์ž‘์—…์ด๋ผ๋Š” ๊ฒƒ์„ ๋œปํ•˜๋Š”๋ฐ์š”.\n์™œ ๋ถˆ์ˆ˜๋Šฅ์ด ๋˜๋Š”์ง€ ์•„๋ฌด๋„ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค.\n๋ถˆ์ˆ˜๋Šฅ์ด ๋˜๋Š” ์ˆœ๊ฐ„ ์™ธ๊ณ„์ธ๋“ค์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฑธ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค

Llama-3 8b
 2์‹œ๊ฐ„๋™์•ˆ 1๋ฒˆ 2๋ฒˆ 3๋ฒˆ 4๋ฒˆ 5๋ฒˆ 6๋ฒˆ ๋ฌธ์ œ๋ฅผ ํ’€์—ˆ์Šต๋‹ˆ๋‹ค.\n์ด๋ฒˆ์— 1๋ฒˆ๋ฌธ์ œ๋Š” ์ˆ˜๋Šฅ์—์„œ ๋‚˜์™”๋˜ ๋ฌธ์ œ์˜€์–ด์š”. 2๋ฒˆ์€ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค.\n๊ทธ๋ฆฌ๊ณ  3๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ์˜€์–ด์š”.\n4๋ฒˆ๋ฌธ์ œ๋Š” ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n5๋ฒˆ๋ฌธ์ œ๋Š” ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์•„์š”.\n6๋ฒˆ๋ฌธ์ œ๋Š” ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n๊ทธ๋ฆฌ๊ณ  7๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n๊ทธ๋ฆฌ๊ณ  8๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n9๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n10๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n11๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n12๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n13๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n์ด๋ฒˆ์— 14๋ฒˆ๋ฌธ์ œ๋„ ์ˆ˜๋Šฅ์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ์˜€์–ด์š”.\n15๋ฒˆ๋ฌธ์ œ๋„

open-llama-2-ko-7b
ไปŠๅ›žใฏๅฎŒๅ…จใซๆต…ๆฐดใงใ—ใŸใ€‚\nไปŠๅ›žใฏๅฎŒๅ…จใซไธ้›ฃใงใ—ใŸใ€‚\n\nไปŠๅ›žใฏๅฎŒๅ…จใซไธ้›ฃใงใ—ใŸใ€‚

polyglot-ko-12.8b
๊ทธ๋ž˜์„œ ๋” ์–ด๋ ค์› ์„๊ฑฐ์—์š”.\u200b๊ทธ๋ž˜์„œ ๋”๋”๋” ๊ธด์žฅ๋˜๊ณ , ๋”๋”๋” ํž˜๋“ค์—ˆ์„๊ฑฐ์—์š”.\u200b\u200b\u200bํ•˜์ง€๋งŒ, \u200b\u200b\u200b์šฐ๋ฆฌ๋Š” ์ˆ˜๋Šฅ์ด๋ผ๋Š” ํฐ ์‚ฐ์„ ๋„˜์—ˆ์œผ๋‹ˆ,\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b \u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b \u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b

Ko-Mixtral-v1.3-MoE-7Bx2
๊ทธ๋ž˜์„œ ์ €๋Š” ์•…์ˆ˜ ์—†์„ ๊ฑฐ์˜ˆ์š”, ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋Š” ์ด ์‚ฌ๋žŒ์—๊ฒŒ ์•…์ˆ˜ํ•˜์ง€ ์•Š์„ ๊ฑฐ์˜ˆ์š”. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‚˜๋Š” ์•…์ˆ˜๋ฅผ ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  ๋งํ–ˆ๊ณ , ๋‚˜๋Š” ์•…์ˆ˜ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  ๋งํ–ˆ๊ณ , ๋‚˜๋Š” ์•…์ˆ˜ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  ๋งํ–ˆ์–ด์š”.\n\n\n\n23:32\n\n๊ทธ๋ฆฌ๊ณ  ์—ฌํ˜ธ์™€๋Š” ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ๊ณ , ๊ทธ์˜ ๋ง์„ ๋“ค์—ˆ

Authors

Jean Seo(seemdog@snua.ac.kr)
Sungjoo Byun(byunsj@snu.ac.kr)
Jaeyoon Kim(toscour345@snu.ac.kr)
Hyemi Jo(huimei6361@snu.ac.kr)
Chaeyoung Oh(nyong10@snu.ac.kr)
Dongjun Jang(qwer4107@snu.ac.kr)
Corresponding author: Hyopil Shin(hpshin@snu.ac.kr)

Acknowledgement

Mixnu is developed by SNU CL_NLP Lab with the support of BIGDATABUB UNIVERSITY.

Citation

@misc {jean_seo_2024,
    author       = { {Jean Seo} },
    title        = { Mixnu (Revision 0bf07fb) },
    year         = 2024,
    url          = { https://huggingface.co/seemdog/Mixnu },
    doi          = { 10.57967/hf/2809 },
    publisher    = { Hugging Face }
}
Downloads last month
10
Safetensors
Model size
19.8B params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for seemdog/Mixnu

Quantizations
2 models