Update on the Evaluation Method for Ko Common Gen V2

#11
by Limerobot - opened

Update on the Evaluation Method for Ko Common Gen V2

We previously employed datasets utilizing the MMLU (which only considers the alphabet/numbers of the answer) and ARC (which solely looks at the sample answers) for evaluations. However, we did not use the widely-adopted AI Harness method (which looks at both the answer's alphabet and sample sentences). To address this gap, we have modified the KoCommonGen dataset to evaluate models using the AI Harness method. More information can be found at AI Harness (https://huggingface.co/blog/evaluating-mmlu-leaderboard).

AI Harness (MMLU+ARC) Method

concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: 2. I did not give a lecture on a moral topic.

MMLU Method

concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: 2

ARC Method

concept set: {I, moral, content, topic, lecture, do}
1. I give a lecture with a moral theme.
2. I did not give a lecture on a moral topic.
3. I gave a lecture because of a moral topic.
4. Moral content makes me lecture.
Answer: I did not give a lecture on a moral topic.

Additionally, the previous dataset had a tendency for the distribution of correct answers to be skewed to one side. This was adjusted to resemble methods like MMLU and ARC, ensuring each answer has an equal distribution. In other words, answers 1, 2, 3, and 4 in the test set were randomized so that each has a 25% chance of appearing.

Models submitted will be re-evaluated sequentially. During the re-evaluation period, they might temporarily disappear or become unstable on the leaderboard.

We plan to continuously evaluate models using a wider variety of methods and adding more datasets in the future. Due to additions or changes in the data, scores of some models may change. We kindly ask for your understanding in advance. Rest assured, we are committed to ensuring that more models are evaluated using a comprehensive set of data.

NLP & AI Lab at Korea University / Upstage

Ko Common Gen V2 평가 방법 μ—…λ°μ΄νŠΈ μ•ˆλ‚΄

μ•ˆλ…•ν•˜μ„Έμš”.

저희가 κΈ°μ‘΄μ—λŠ” MMLU (μ •λ‹΅μ˜ μ•ŒνŒŒλ²³/숫자만 λ³΄λŠ” 방식)κ³Ό ARC (μ •λ‹΅ 예문만 λ³΄λŠ” 방식)의 데이터 셋을 μ‚¬μš©ν•˜μ—¬ ν‰κ°€ν•˜μ˜€μ§€λ§Œ, 많이 μ‚¬μš©λ˜κ³  μžˆλŠ” AI Harness (μ •λ‹΅ μ•ŒνŒŒλ²³κ³Ό μ˜ˆλ¬Έμ„ 같이 λ³΄λŠ”) λ°©μ‹μ˜ λ°μ΄ν„°λŠ” μ—†μ—ˆμŠ΅λ‹ˆλ‹€. 이λ₯Ό λ³΄μ™„ν•˜κΈ° μœ„ν•΄ KoCommonGen 데이터셋을 AI Harness (https://huggingface.co/blog/evaluating-mmlu-leaderboard) λ°©μ‹μœΌλ‘œ μˆ˜μ •ν•˜μ—¬ λͺ¨λΈμ„ ν‰κ°€ν•©λ‹ˆλ‹€.

AI Harness (MMLU+ARC) 방식

concept set: {λ‚˜, κ΅ν›ˆμ , λ‚΄μš©, 주제, κ°•μ—°, ν•˜λ‹€}
1. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ 강연을 ν•˜λ‹€.
2. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ„ 주제둜 강연을 ν•˜μ§€ μ•Šμ•˜λ‹€.
3. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ 주제 λ•Œλ¬Έμ— 강연을 ν–ˆμ–΄.
4. κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ λ‚˜λ₯Ό κ°•μ—°μ—κ²Œ ν•˜λ‹€.
μ •λ‹΅: 2. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ„ 주제둜 강연을 ν•˜μ§€ μ•Šμ•˜λ‹€.

MMLU 방식

concept set: {λ‚˜, κ΅ν›ˆμ , λ‚΄μš©, 주제, κ°•μ—°, ν•˜λ‹€}
1. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ 강연을 ν•˜λ‹€.
2. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ„ 주제둜 강연을 ν•˜μ§€ μ•Šμ•˜λ‹€.
3. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ 주제 λ•Œλ¬Έμ— 강연을 ν–ˆμ–΄.
4. κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ λ‚˜λ₯Ό κ°•μ—°μ—κ²Œ ν•˜λ‹€.
μ •λ‹΅: 2

ARC 방식

concept set: {λ‚˜, κ΅ν›ˆμ , λ‚΄μš©, 주제, κ°•μ—°, ν•˜λ‹€}
1. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ 강연을 ν•˜λ‹€.
2. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ„ 주제둜 강연을 ν•˜μ§€ μ•Šμ•˜λ‹€.
3. λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ 주제 λ•Œλ¬Έμ— 강연을 ν–ˆμ–΄.
4. κ΅ν›ˆμ μΈ λ‚΄μš©μ΄ λ‚˜λ₯Ό κ°•μ—°μ—κ²Œ ν•˜λ‹€.
μ •λ‹΅: λ‚˜λŠ” κ΅ν›ˆμ μΈ λ‚΄μš©μ„ 주제둜 강연을 ν•˜μ§€ μ•Šμ•˜λ‹€.

λ˜ν•œ, 이전 데이터셋은 μ •λ‹΅ 번호의 뢄포가 ν•œ μͺ½μœΌλ‘œ μΉ˜μš°μΉ˜λŠ” κ²½ν–₯이 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. 이λ₯Ό MMLU, ARC λ“±κ³Ό 같이 각 λ‹΅μ˜ 뢄포가 λ™μΌν•˜κ²Œ μˆ˜μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€. 즉, ν…ŒμŠ€νŠΈμ…‹μ˜ 1, 2, 3, 4번 닡이 각각 25%의 ν™•λ₯ λ‘œ λ‚˜μ˜¬ 수 μžˆλ„λ‘ 정닡을 λ¬΄μž‘μœ„λ‘œ μž¬λ°°μ—΄ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

제좜된 λͺ¨λΈλ“€μ€ 순차적으둜 μž¬ν‰κ°€λ˜λ©°, μž¬ν‰κ°€λ˜λŠ” κΈ°κ°„ λ™μ•ˆ λ¦¬λ”λ³΄λ“œμ—μ„œ μΌμ‹œμ μœΌλ‘œ λΉ μ§€κ±°λ‚˜ λΆˆμ•ˆμ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ €ν¬λŠ” μ•žμœΌλ‘œλ„ 더 λ‹€μ–‘ν•œ λ°©λ²•μœΌλ‘œ, 그리고 μ§€μ†μ μœΌλ‘œ 더 λ§Žμ€ 데이터셋을 μΆ”κ°€ν•˜μ—¬ λͺ¨λΈλ“€μ„ 평가할 μ˜ˆμ •μž…λ‹ˆλ‹€. 데이터 등이 μΆ”κ°€/λ³€κ²½λ¨μœΌλ‘œ 일뢀 λͺ¨λΈμ˜ μ μˆ˜λŠ” 변경될 수 μžˆμŒμ— λŒ€ν•΄ 미리 μ–‘ν•΄ 뢀탁 λ“œλ¦½λ‹ˆλ‹€. 더 λ§Žμ€ λͺ¨λΈλ“€μ΄ 더 λ§Žμ€ λ°μ΄ν„°λ‘œ 평가될 수 μžˆλ„λ‘ μ΅œμ„ μ„ λ‹€ν•˜κ² μŠ΅λ‹ˆλ‹€.

κ³ λ €λŒ€ν•™κ΅ NLP & AI Lab / μ—…μŠ€ν…Œμ΄μ§€

hunkim pinned discussion

Hello. Starting from now, we plan to evaluate ko-commongenv2 scores. To prevent confusion in the results, all scores currently on the leaderboard will be deleted. To avoid inconvenience in score comparison during reevaluation and to allow for checking past scores, we have created a copy of the current leaderboard at the link below:

https://huggingface.co/spaces/choco9966/open-ko-llm-leaderboard

Please note that submitted model on the choco9966/open-ko-llm-leaderboard will not be evaluated, so use it as a reference.

μ•ˆλ…•ν•˜μ„Έμš”. ν˜„ μ‹œμ λΆ€ν„° ko-commongenv2 점수의 μž¬μΈ‘μ •μ„ μ§„ν–‰ν•˜λ €κ³  ν•©λ‹ˆλ‹€. 결과물의 ν˜Όλ™μ„ 막기 μœ„ν•΄ λ¦¬λ”λ³΄λ“œμ— 올라온 μ μˆ˜λŠ” λͺ¨λ‘ μ‚­μ œν•˜κ³  μ§„ν–‰ν•˜λ €ν•©λ‹ˆλ‹€. μž¬ν‰κ°€λ˜λŠ” λ™μ•ˆμ˜ 점수 λΉ„κ΅μ˜ λΆˆνŽΈν•¨μ„ λ°©μ§€ν•˜κ³  과거의 점수λ₯Ό 확인할 수 μžˆλ„λ‘ μ•„λž˜μ˜ 링크둜 ν˜„μž¬ λ¦¬λ”λ³΄λ“œμ˜ 볡사본을 λ§Œλ“€μ–΄λ‘μ—ˆμŠ΅λ‹ˆλ‹€.

https://huggingface.co/spaces/choco9966/open-ko-llm-leaderboard

참고둜, μœ„μ˜ λ¦¬λ”λ³΄λ“œμ— 제좜된 λͺ¨λΈμ€ 채점은 λ˜μ§€λŠ” μ•ŠμœΌλ‹ˆ μ°Έκ³ ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.

What evaluation metrics are used to evaluate Common Gen V2? @choco9966 @Limerobot

Sign up or log in to comment