Can't reproduce hellaswag result - getting 42.3% v.s. 71.4 % reported

#67

by robgarct - opened Jun 28, 2024

Discussion

robgarct

Jun 28, 2024

•

edited Jun 28, 2024

Hi! hope all is good.

I'm trying to reproduce the hellaswag result obtained through lm-evaluation-harness. Following the discussion from https://huggingface.co/google/gemma-2b/discussions/18, I:

Pulled lm-evaluation-harness form commit b281b0921b636bc36ad05c0b0b0763bd6dd43463 and set it up in a fresh conda environment
Ran:

$ python main.py --model hf-causal-experimental  --model_args pretrained=google/gemma-2b,dtype=float32  --tasks hellaswag  --device cuda:0  --batch_size 1

Got the following results:

{
  "results": {
    "hellaswag": {
      "acc": 0.34116709818761204,
      "acc_stderr": 0.0047313244091332675,
      "acc_norm": 0.42342162915753834,
      "acc_norm_stderr": 0.004930911515084784
    }
  },
  "versions": {
    "hellaswag": 0
  },
  "config": {
    "model": "hf-causal-experimental",
    "model_args": "pretrained=google/gemma-2b,dtype=float32",
    "num_fewshot": 0,
    "batch_size": "1",
    "batch_sizes": [],
    "device": "cuda:0",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}
hf-causal-experimental (pretrained=google/gemma-2b,dtype=float32), limit: None, provide_description: False, num_fewshot: 0, batch_size: 1
|  Task   |Version| Metric |Value |   |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag|      0|acc     |0.3412|±  |0.0047|
|         |       |acc_norm|0.4234|±  |0.0049|

Am I doing something obviously wrong? As can be seen from the output, I'm getting an accuracy of 42.3%. However, the paper reports an accuracy of 71.4% in hellaswag (similar to the 71.77 listed in the open llm leaderboard: https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard).

Thanks in advance!

robgarct

Jun 28, 2024

•

edited Jun 28, 2024

Taking a deeper look at the results from the open llm leaderboard (https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, https://huggingface.co/datasets/open-llm-leaderboard-old/details_google__gemma-2b), from those links, if I'm reading this correctly, it seems like the 71.77% accuracy listed in the open llm leaderboard for hellaswag was obtained using 10 few-shot examples per:

...
    "harness|hellaswag|10": {
      "hashes": {
        "hash_examples": "e1768ecb99d7ecf0",
        "hash_full_prompts": "0b4c16983130f84f",
        "hash_input_tokens": "11490eb47260730b",
        "hash_cont_tokens": "6a8516a792e1673e"
      },
      "truncated": 0,
      "non_truncated": 10042,
      "padded": 40055,
      "non_padded": 113,
      "effective_few_shots": 10.0,
      "num_truncated_few_shots": 0
    },
...

Is this read correct?

Also, as noted above, the paper (https://arxiv.org/pdf/2403.08295) and its hugging face page (https://huggingface.co/google/gemma-2b) list a similar accuracy (71.4%) for hellaswag, but note that it was obtained using 0-shot. Is there any way to possibly replicate the 0-shot results listed there through lm-eval-harness or lighteval?

Thanks in advance

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment