cchristophe
commited on
Commit
•
fbca64b
1
Parent(s):
e234470
Update README.md
Browse files
README.md
CHANGED
@@ -163,6 +163,7 @@ Which response is of higher overall quality in a medical context? Consider:
|
|
163 |
|
164 |
[Include Image]
|
165 |
|
|
|
166 |
### MCQA Evaluation
|
167 |
|
168 |
Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
|
|
|
163 |
|
164 |
[Include Image]
|
165 |
|
166 |
+
|
167 |
### MCQA Evaluation
|
168 |
|
169 |
Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
|