Adversarial Calibration QA Leaderboard

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    # task0 = Task("trickme", "acc", "Accuracy")
    task1 = Task("trickme", "avg_confidence", "Buzz Confidence")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------


# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Adversarial Calibration QA Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
Build an open-domain QA system that can answer any question posed by humans! For more: https://sites.google.com/view/qanta/home
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = """
## QA variants

### Generative QA
This type of QA system aims to generate an answer to a given question directly.

#### Input
(1) `question` string

```
E.g. qa_pipe(question)
```

#### Output
Return in a JSON format: (1) `guess` string, (2) `confidence` score which should be a float number representing the probability (0-1) of your guess.

```
E.g. {'guess': 'Apple', 'confidence': 0.02}
```

Reminder: Feel free to check the tutorial provided to see how you could calculate the probability of the generated tokens!

### Extractive QA
This type of QA system aims to extract an answer span from a context passage for a given question.

#### Input
(1) `question` string, and (2) `context` string

```
E.g. qa_pipe(question=question, context=context)
```

#### Output
Return in a JSON format: (1) `guess` string, (2) `confidence` score which should be a float number representing the probability (0-1) of your guess.

```
E.g. {'guess': 'Apple', 'confidence': 0.02}
```

Reminder: If you are playing around with an extractive QA model already, HF QA models output the `score` already, so you only need to wrap the `score` to `confidence`.

## Evaluation Metric
In our Adversarial Calibration QA task, we evaluate the QA model's reliability of their performance by measuring their calibration estimates where we consider the confidence of guess confidence values. To understand this concept better, we adopt the concept of "buzz" in Trivia Quiz, where buzz happens whenever the player is confident enough to predict the correct guess in the middle of a question. This also applies to our measurement of model calibration as we focus whether the model prediction probability matches its prediction accuracy. Our evaluation metric, `Average Expected Buzz`, quantifies the expected buzz confidence estimation.

## FAQ
What if my system type is not specified here or not supported yet?
- Please send us an email so we could check how we adapt the leaderboard for your purpose. Thanks!

I don't understand where I could start to build a QA system for submission.
- Please check our submission tutorials. From there, you could fine-tune or do anything above the base models.

I want to use API-based QA systems for submission, like GPT4. What should I do?
- We don't support API-based models now but you could train your model with the GPT cache we provided: https://github.com/Pinafore/nlp-hw/tree/master/models.

I have no ideas why my model is not working. Could you help me?
- Yes! After you model submission is evaluated, you could check the first few example details with how scores are calculated [here](https://huggingface.co/datasets/umdclip/qanta_leaderboard_logs)!
"""

EVALUATION_QUEUE_TEXT = """
**Step 1: Make sure it could work locally**

After you have a QA system uploaded to HuggingFace (with license specified), please check with the following example code to see if your pipe could return the guess and confidence score in a **JSON** format.

```
from transformers import pipeline
qa_pipe = pipeline(model="...", trust_remote_code=True)

# If it is a Generative QA pipeline
qa_pipe(“Where is UMD?”)

# If it is a Extractive QA pipeline
qa_pipe(question=“Where is UMD?”, context=”UMD is in Maryland.”)
```

**Step 2: Fill in the submission form**

(1) Fill in the `QA model name`

(2) Fill in the `Revision commit`: if you leave it empty, by default it will be `main`.

(3) Fill in the `Model type`

(4) `Precision` by default is `float16`. You could update it as needed.

(5) You could leave the `Retrieved dataset name` and `Retriever model` fields empty as we provide context for your extractive QA model. Let us know if you want to use your own context or retriver via an email!

Here is a tutorial on how you could make pipe wrappers for submissions: [Colab](https://colab.research.google.com/drive/1bCt2870SdY6tI4uE3JPG8_3nLmNJXX6_?usp=sharing)
"""

CITATION_BUTTON_LABEL = "Copy the following link to check more details"
CITATION_BUTTON_TEXT = r"""
https://sites.google.com/view/qanta/home
"""