File size: 4,381 Bytes
4fb1290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Evaluation Guidelines
We provide detailed instructions for evaluation. 
To execute our evaluation script, please ensure that the structure of your model outputs is the same as ours.

We provide two options:
1. Evaluation only: you can parse the response on your own and simply provide one file with all the final predictions.
2. Parse and evaluation: you can leave all the responses to us with the output formats shown below.

## Evaluation Only
If you want to use your own parsing logic and *only provide the final answer*, you can use `main_eval_only.py`.

You can provide all the outputs in *one file* in the following format:

```
{
    "validation_Accounting_1": "D", # strictly "A", "B", "C", "D" for multi-choice question
    "validation_Architecture_and_Engineering_14": "0.0", # any string response for open question.
    ...
}
```
Then run eval_only with:
```
python main_eval_only.py --output_path ./example_outputs/llava1.5_13b/total_val_output.json
```

Please refer to [example output](https://github.com/MMMU-Benchmark/MMMU/blob/main/eval/example_outputs/llava1.5_13b/total_val_output.json) for a detailed prediction file form.


## Parse and Evaluation
You can also provide response and run the `main_parse_and_eval.py` to use our answer parsing processing and evaluation pipeline as follows:

### Output folder structure

```
└── model_name
    β”œβ”€β”€ category_name (e.g., Accounting)
    β”‚   β”œβ”€β”€ output.json
    └── category_name (e.g., Electronics)
        β”œβ”€β”€ output.json
    ...
```

### Output file
Each `output.json`` has a list of dict containing instances for evaluation ().
```
[
    {
        "id": "validation_Electronics_28",
        "question_type": "multiple-choice",
        "answer": "A", # given answer
        "all_choices": [ # create using `get_multi_choice_info` in 
            "A",
            "B",
            "C",
            "D"
        ],
        "index2ans": { # create using `get_multi_choice_info` in 
            "A": "75 + 13.3 cos(250t - 57.7Β°)V",
            "B": "75 + 23.3 cos(250t - 57.7Β°)V",
            "C": "45 + 3.3 cos(250t - 57.7Β°)V",
            "D": "95 + 13.3 cos(250t - 57.7Β°)V"
        },
        "response": "B" # model response
    },
    {
        "id": "validation_Electronics_29",
        "question_type": "short-answer",
        "answer": "30", # given answer
        "response": "36 watts" # model response
    },
    ...
]
```

### Evaluation
```
python main_parse_and_eval.py --path ./example_outputs/llava1.5_13b --subject ALL # all subject

# OR you can sepecify one subject for the evaluation

python main_parse_and_eval.py --path ./example_outputs/llava1.5_13b --subject elec # short name for Electronics. use --help for all short names

```

`main_parse_and_eval.py` will generate `parsed_output.json` and `result.json` in the subfolder under the same category with output.json, respectively.

```
β”œβ”€β”€ Accounting
β”‚   β”œβ”€β”€ output.json
β”‚   β”œβ”€β”€ parsed_output.json
β”‚   └── result.json
└── Electronics
    β”œβ”€β”€ output.json
    β”œβ”€β”€ parsed_output.json
    └── result.json
...
```

### Print Results
You can print results locally if you want. (use `pip install tabulate` if you haven't)
```
python print_results.py --path ./example_outputs/llava1.5_13b
# Results may be slightly different due to the ramdon selection for fail response
```



##### Run Llava
In case if you want to reproduce the results of some models, please go check `run_llava.py` as an example.

By seeting up the env for llava via following steps:

Step 1:
```
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
```
In Step 2:
```
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
git fetch --tags  
git checkout tags/v1.1.3  # back to the version when running MMMU
pip install -e .
```

The above will install llava (1.5 only) and corresponding correct transformers version when running MMMU.
Then by installing `datasets` packages from huggingface (i.e., `pip install datasets`), you can run llava with the following command:

```
CUDA_VISIBLE_DEVICES=0 nohup python run_llava.py \
--output_path example_outputs/llava1.5_13b_val.json \
--model_path liuhaotian/llava-v1.5-13b \
--config_path configs/llava1.5.yaml
```

Then you can evaluate the results via the very first pipeline.