href / src /md.py
Shane
fixed bug
db557f0
raw
history blame
3.07 kB
from datetime import datetime
import pytz
ABOUT_TEXT = """
## Overview
HREF is evaluation benchmark that evaluates language models' capacity of following human instructions. It is consisted of 4,258 instructions covering 11 distinct categories, including Brainstorm ,Open QA ,Closed QA ,Extract ,Generation ,Rewrite ,Summarize ,Coding ,Classify ,Fact Checking or Attributed QA ,Multi-Document Synthesis , and Reasoning Over Numerical Data.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/dSv3U11h936t_q-aiqbkV.png)
## Generation Configuration
For reproductability, we use greedy decoding for all model generation as default. We apply chat templates to the instructions if they are implemented in model's tokenizer or explicity recommanded by the model's creators. Please contact us if you would like to change this default configuration.
## Why HREF
| Benchmark | Size | Evaluation Method | Baseline Model | Judge Model | Task Oriented | Contamination Resistant | Contains Human Reference|
|--------------------|-------|------------|----------------|----------------|----------|------------|-----------|
| MT-Bench | 80 | Score | --- | gpt4 | βœ“ | βœ— | βœ— |
| AlpacaEval 2.0 | 805 | PWC | gpt4-turbo | gpt4-turbo | βœ— | βœ— | βœ— |
| Chatbot Arena | --- | PWC | --- | Human | βœ— | βœ“ | βœ— |
| Arena-Hard | 500 | PWC | gpt4-0314 | gpt4-turbo | βœ— | βœ— | βœ— |
| WildBench | 1,024 | Score/PWC | gpt4-turbo | three models | βœ— | βœ— | βœ— |
| **HREF** | 4,258 | PWC | Llama-3.1-405B-Instruct | Llama-3.1-70B-Instruct | βœ“ | βœ“ | βœ“ |
- **Human Reference**: HREF leverages human-written answer as reference to provide more reliable evaluation than previous method.
- **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable.
- **Contamination-resistant**: HREF's evaluation set is hidden and uses public models for both the baseline model and judge model, which makes it completely free of contamination.
- **Task Oriented**: Instead of naturally collected instructions from the user, HREF contains instructions that are written specifically targetting 8 distinct categories that are used in instruction tuning, which allows it to provide more insights about how to improve language models.
## Contact Us
TODO
"""
# Get Pacific time zone (handles PST/PDT automatically)
pacific_tz = pytz.timezone('America/Los_Angeles')
current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")
TOP_TEXT = f"""# HREF: Human Reference Guided Evaluation for Instructiong Following
[Code]() | [Validation Set]() | [Human Agreement Set]() | [Results]() | [Paper]() | Total models: {{}} | Last restart (PST): {current_time}
"""