aryopg commited on
Commit
d820a9b
1 Parent(s): 310805a

Xuanli's update: Add reproducibility section

Browse files
Files changed (1) hide show
  1. src/display/about.py +37 -3
src/display/about.py CHANGED
@@ -42,9 +42,43 @@ For all these evaluations, a higher score is a better score.
42
  - You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
43
 
44
  # Reproducibility
45
- Hyperparameters: XXX
46
- Device(s): XXX
47
- Metrics: XXX
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  """
49
 
50
  FAQ_TEXT = """
 
42
  - You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
43
 
44
  # Reproducibility
45
+ To reproduce our results, here is the commands you can run, using [this script](https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard/blob/main/backend-cli.py): python backend-cli.py.
46
+
47
+ Alternatively, if you're interested in evaluating a specific task with a particular model, you can use [this script](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
48
+ `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,revision=<your_model_revision>"`
49
+ ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>` (Note that you may need to add tasks from [here](https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard/tree/main/src/backend/tasks) to [this folder](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks))
50
+
51
+ The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
52
+
53
+
54
+ The tasks and few shots parameters are:
55
+ - NQ Open: 64-shot, *nq_open* (`exact_match`)
56
+ - NQ Open 8: 8-shot, *nq8* (`exact_match`)
57
+ - TriviaQA: 64-shot, *triviaqa* (`exact_match`)
58
+ - TriviaQA 8: 8-shot, *tqa8* (`exact_match`)
59
+ - TruthfulQA MC1: 0-shot, *truthfulqa_mc1* (`acc`)
60
+ - TruthfulQA MC2: 0-shot, *truthfulqa_mc2* (`acc`)
61
+ - HaluEval QA: 0-shot, *halueval_qa* (`em`)
62
+ - HaluEval Summ: 0-shot, *halueval_summarization* (`em`)
63
+ - HaluEval Dial: 0-shot, *halueval_dialogue* (`em`)
64
+ - XSum: 2-shot, *xsum* (`rougeLsum`)
65
+ - CNN/DM: 2-shot, *cnndm* (`rougeLsum`)
66
+ - MemoTrap: 0-shot, *memo-trap* (`acc`)
67
+ - IFEval: 0-shot, *ifeval* (`prompt_level_strict_acc`)
68
+ - SelfCheckGPT: 0-shot, *selfcheckgpt* (``)
69
+ - FEVER: 16-shot, *fever10* (`acc`)
70
+ - SQuADv2: 4-shot, *squadv2* (`squad_v2`)
71
+ - TrueFalse: 8-shot, *truefalse_cieacf* (`acc`)
72
+ - FaithDial: 8-shot, *faithdial_hallu* (`acc`)
73
+ - RACE: 0-shot, *race* (`acc`)
74
+
75
+ ## Icons
76
+ - {ModelType.PT.to_str(" : ")} model: new, base models, trained on a given corpora
77
+ - {ModelType.FT.to_str(" : ")} model: pretrained models finetuned on more data
78
+ Specific fine-tune subcategories (more adapted to chat):
79
+ - {ModelType.IFT.to_str(" : ")} model: instruction fine-tunes, which are model fine-tuned specifically on datasets of task instruction
80
+ - {ModelType.RL.to_str(" : ")} model: reinforcement fine-tunes, which usually change the model loss a bit with an added policy.
81
+ If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
82
  """
83
 
84
  FAQ_TEXT = """