Spaces:

upstage
/

open-ko-llm-leaderboard

Running on CPU Upgrade

File size: 11,713 Bytes

7776cef
 
097981b
 
7776cef
7b9fe5e
7776cef
 
 
 
 
 
 
7b9fe5e
 
e6cfe9b
7b9fe5e
e6cfe9b
7b9fe5e
e6cfe9b
 
 
7b9fe5e
 
 
e6cfe9b
 
4c0ff9d
7b9fe5e
e6cfe9b
7b9fe5e
e6cfe9b
7b9fe5e
e6cfe9b
 
 
 
 
 
 
 
 
7b9fe5e
e6cfe9b
59ba8a1
e6cfe9b
9c13f5f
e6cfe9b
7b9fe5e
e6cfe9b
7b9fe5e
e6cfe9b
 
7b9fe5e
4c0ff9d
e6cfe9b
 
7b9fe5e
 
097981b
 
 
 
 
7b9fe5e
e6cfe9b
 
 
 
 
7b9fe5e
e6cfe9b
7b9fe5e
e6cfe9b
 
 
 
 
 
 
 
 
 
 
 
7b9fe5e
 
 
 
 
 
e6cfe9b
59ba8a1
e6cfe9b
 
 
59ba8a1
e6cfe9b
59ba8a1
e6cfe9b
7b9fe5e
e6cfe9b
7b9fe5e
e6cfe9b
 
 
7b9fe5e
4c0ff9d
7b9fe5e
e6cfe9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b9fe5e
 
96ea9f1
7b9fe5e
c1068ee
 
 
 
e6cfe9b
7b9fe5e
e6cfe9b
 
0e609ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6cfe9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e609ea
e6cfe9b
 
 
 
 
 
0e609ea
e6cfe9b
 
 
 
 
 
0e609ea
e6cfe9b
 
 
 
 
 
 
 
 
 
0e609ea
e6cfe9b
 
 
 
 
 
0e609ea
a86ccea

import os
import base64
from src.display.utils import ModelType

current_dir = os.path.dirname(os.path.realpath(__file__))

with open(os.path.join(current_dir, "main_logo.png"), "rb") as image_file:
    main_logo = base64.b64encode(image_file.read()).decode('utf-8')
with open(os.path.join(current_dir, "host_sponsor.png"), "rb") as image_file:
    host_sponsor = base64.b64encode(image_file.read()).decode('utf-8')

TITLE = f"""<img src="data:image/jpeg;base64,{main_logo}" style="width:30%;display:block;margin-left:auto;margin-right:auto">"""
BOTTOM_LOGO = f"""<img src="data:image/jpeg;base64,{host_sponsor}" style="width:75%;display:block;margin-left:auto;margin-right:auto">"""

INTRODUCTION_TEXT = f"""
The previous Leaderboard version is live [here](https://huggingface.co/spaces/choco9966/open-ko-llm-leaderboard-old) 📊 

🚀 The Open Ko-LLM Leaderboard2 🇰🇷 objectively evaluates the performance of Korean Large Language Model (LLM). When you submit a model on the "Submit here!" page, it is automatically evaluated. 

This leaderboard is co-hosted by [Upstage](https://www.upstage.ai/), and [NIA](https://www.nia.or.kr/site/nia_kor/main.do) that provides various Korean Data Sets through [AI-Hub](https://aihub.or.kr/), and operated by [Upstage](https://www.upstage.ai/). The GPU used for evaluation is operated with the support of [KT](https://cloud.kt.com/) and [AICA](https://aica-gj.kr/main.php). If Season 1 focused on evaluating the capabilities of the LLM in terms of reasoning, language understanding, hallucination, and commonsense through academic benchmarks, Season 2 will focus on assessing the LLM's practical abilities and reliability. The datasets for this season are sponsored by [Flitto](https://www.flitto.com/portal/en), [SELECTSTAR](https://selectstar.ai/ko/), and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1). The evaluation dataset is exclusively private and only available for evaluation process. More detailed information about the benchmark dataset is provided on the “About” page.

You'll notably find explanations on the evaluations we are using, reproducibility guidelines, best practices on how to submit a model, and our FAQ.
"""

LLM_BENCHMARKS_TEXT = f"""
# Motivation

While outstanding LLM models are being released competitively, most of them are centered on English and are familiar with the English cultural sphere. We operate the Korean leaderboard, 🚀 Open Ko-LLM, to evaluate models that reflect the characteristics of the Korean language and Korean culture. Through this, we hope that users can conveniently use the leaderboard, participate, and contribute to the advancement of research in Korean.

## How it works

📈 We evaluate models on 9 key benchmarks using the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) , a unified framework to test generative language models on a large number of different evaluation tasks.

- Ko-GPQA (provided by [Flitto](https://www.flitto.com/portal/en))
- Ko-WinoGrande (provided by [Flitto](https://www.flitto.com/portal/en))
- Ko-GSM8K (provided by [Flitto](https://www.flitto.com/portal/en))
- Ko-EQ-Bench (provided by [Flitto](https://www.flitto.com/portal/en))
- Ko-IFEval (provided by [Flitto](https://www.flitto.com/portal/en))
- KorNAT-Knowledge (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1))
- KorNAT-Social-Value (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1))
- Ko-Harmlessness (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1))
- Ko-Helpfulness (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1))

For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning, harmlessness, helpfulness and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

The final score is converted to the average score from each evaluation datasets.

GPUs are provided by [KT](https://cloud.kt.com/) and [AICA](https://aica-gj.kr/main.php) for the evaluations.

## **Results**

- Detailed numerical results in the `results` Upstage dataset: https://huggingface.co/datasets/open-ko-llm-leaderboard/results
- Community queries and running status in the `requests` Upstage dataset: https://huggingface.co/datasets/open-ko-llm-leaderboard/requests

## More resources

If you still have questions, you can check our FAQ [here](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard/discussions/1)!
"""


FAQ_TEXT = """
"""


EVALUATION_QUEUE_TEXT = f"""
# Evaluation Queue for the 🤗 Open Ko-LLM Leaderboard

Models added here will be automatically evaluated on the 🤗 cluster.

## Submission Disclaimer

**By submitting a model, you acknowledge that:**

- We store information about who submitted each model in [Requests dataset](https://huggingface.co/datasets/open-ko-llm-leaderboard/requests).
- This practice helps maintain the integrity of our leaderboard, prevent spam, and ensure responsible submissions.
- Your submission will be visible to the community and you may be contacted regarding your model.
- Please submit carefully and responsibly 💛

## First Steps Before Submitting a Model

### 1. Ensure Your Model Loads with AutoClasses

Verify that you can load your model and tokenizer using AutoClasses:

```jsx
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```

Note:

- If this step fails, debug your model before submitting.
- Ensure your model is public.
- We are working on adding support for models requiring `use_remote_code=True`.

### 2. Convert Weights to Safetensors

[Safetensors](https://huggingface.co/docs/safetensors/index) is a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

### 3. Verify Your Model Open License

This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

### 4. Complete Your Model Card

When we add extra information about models to the leaderboard, it will be automatically taken from the model card

### 5. Select Correct Precision

Choose the right precision to avoid evaluation errors:

- Not all models convert properly from float16 to bfloat16.
- Incorrect precision can cause issues (e.g., loading a bf16 model in fp16 may generate NaNs).

> Important: When submitting, git branches and tags will be strictly tied to the specific commit present at the time of submission to ensure revision consistency.
> 

## Model types

- 🟢 : 🟢 pretrained model: new, base models, trained on a given text corpora using masked modelling
- 🟩 : 🟩 continuously pretrained model: new, base models, continuously trained on further corpus (which may include IFT/chat data) using masked modelling
- 🔶 : 🔶 fine-tuned on domain-specific datasets model: pretrained models finetuned on more data
- 💬 : 💬 chat models (RLHF, DPO, IFT, ...) model: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
- 🤝 : 🤝 base merges and moerges model: merges or MoErges, models which have been merged or fused without additional fine-tuning.

Please provide information about the model through an issue! 🤩

🏴‍☠️ : 🏴‍☠️ This icon indicates that the model has been selected as a subject of caution by the community, implying that users should exercise restraint when using it. Clicking on the icon will take you to a discussion about that model. (Models that have used the evaluation set for training to achieve a high leaderboard ranking, among others, are selected as subjects of caution.)
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results. Authors of open-ko-llm-leaderboard are ordered alphabetically."
CITATION_BUTTON_TEXT = r"""
@inproceedings{park2024open,
      title={Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark},
      author={Chanjun Park and Hyeonwoo Kim and Dahyun Kim and Seonghwan Cho and Sanghoon Kim and Sukyung Lee and Yungi Kim and Hwalsuk Lee},
      year={2024},
      booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) }
}


@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628},
}

@misc{rein2023gpqagraduatelevelgoogleproofqa,
  title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
  author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
  year={2023},
  eprint={2311.12022},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2311.12022},
}

@article{sakaguchi2021winogrande,
  title={Winogrande: An adversarial winograd schema challenge at scale},
  author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
  journal={Communications of the ACM},
  volume={64},
  number={9},
  pages={99--106},
  year={2021},
  publisher={ACM New York, NY, USA}
}

@article{cobbe2021training,
  title={Training verifiers to solve math word problems},
  author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others},
  journal={arXiv preprint arXiv:2110.14168},
  year={2021}
}

article{paech2023eq,
  title={Eq-bench: An emotional intelligence benchmark for large language models},
  author={Paech, Samuel J},
  journal={arXiv preprint arXiv:2312.06281},
  year={2023}
}


@misc{zhou2023instructionfollowingevaluationlargelanguage,
  title={Instruction-Following Evaluation for Large Language Models},
  author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou},
  year={2023},
  eprint={2311.07911},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2311.07911},
}

@article{lee2024kornat,
  title={KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge},
  author={Lee, Jiyoung and Kim, Minwoo and Kim, Seungho and Kim, Junghwan and Won, Seunghyun and Lee, Hwaran and Choi, Edward},
  journal={arXiv preprint arXiv:2402.13605},
  year={2024}
}
"""