Can't reproduce hellaswag result - getting 42.3% v.s. 71.4 % reported

#67
by robgarct - opened

Hi! hope all is good.

I'm trying to reproduce the hellaswag result obtained through lm-evaluation-harness. Following the discussion from https://huggingface.co/google/gemma-2b/discussions/18, I:

  1. Pulled lm-evaluation-harness form commit b281b0921b636bc36ad05c0b0b0763bd6dd43463 and set it up in a fresh conda environment
  2. Ran:
$ python main.py --model hf-causal-experimental  --model_args pretrained=google/gemma-2b,dtype=float32  --tasks hellaswag  --device cuda:0  --batch_size 1 
  1. Got the following results:
{
  "results": {
    "hellaswag": {
      "acc": 0.34116709818761204,
      "acc_stderr": 0.0047313244091332675,
      "acc_norm": 0.42342162915753834,
      "acc_norm_stderr": 0.004930911515084784
    }
  },
  "versions": {
    "hellaswag": 0
  },
  "config": {
    "model": "hf-causal-experimental",
    "model_args": "pretrained=google/gemma-2b,dtype=float32",
    "num_fewshot": 0,
    "batch_size": "1",
    "batch_sizes": [],
    "device": "cuda:0",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}
hf-causal-experimental (pretrained=google/gemma-2b,dtype=float32), limit: None, provide_description: False, num_fewshot: 0, batch_size: 1
|  Task   |Version| Metric |Value |   |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag|      0|acc     |0.3412|±  |0.0047|
|         |       |acc_norm|0.4234|±  |0.0049|

Am I doing something obviously wrong? As can be seen from the output, I'm getting an accuracy of 42.3%. However, the paper reports an accuracy of 71.4% in hellaswag (similar to the 71.77 listed in the open llm leaderboard: https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard).

Thanks in advance!

Taking a deeper look at the results from the open llm leaderboard (https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, https://huggingface.co/datasets/open-llm-leaderboard-old/details_google__gemma-2b), from those links, if I'm reading this correctly, it seems like the 71.77% accuracy listed in the open llm leaderboard for hellaswag was obtained using 10 few-shot examples per:

...
    "harness|hellaswag|10": {
      "hashes": {
        "hash_examples": "e1768ecb99d7ecf0",
        "hash_full_prompts": "0b4c16983130f84f",
        "hash_input_tokens": "11490eb47260730b",
        "hash_cont_tokens": "6a8516a792e1673e"
      },
      "truncated": 0,
      "non_truncated": 10042,
      "padded": 40055,
      "non_padded": 113,
      "effective_few_shots": 10.0,
      "num_truncated_few_shots": 0
    },
...

Is this read correct?

Also, as noted above, the paper (https://arxiv.org/pdf/2403.08295) and its hugging face page (https://huggingface.co/google/gemma-2b) list a similar accuracy (71.4%) for hellaswag, but note that it was obtained using 0-shot. Is there any way to possibly replicate the 0-shot results listed there through lm-eval-harness or lighteval?

Thanks in advance

Sign up or log in to comment