m42-health
/

Llama3-Med42-70B

@@ -22,14 +22,14 @@ Med42-v2 is a suite of open-access clinical large language models (LLM) instruct
 |Models|Elo Score|
 |:---:|:---:|
-|Med42-v2-70B| --- |
-|Llama3-70B-Instruct| --- |
-|GPT4-o| --- |
-|Med42-v2-8B| --- |
-|Llama3-8B-Instruct| --- |
-|Mixtral-8x7b-Instruct| --- |
-|OpenBioLLM-70B| --- |
-|JSL-MedLlama-3-8B-v2.0| --- |
 ## Limitations & Safe Use
@@ -129,26 +129,6 @@ The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing P
 ## Evaluation Results
-### MCQA Evaluation
-Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
-|Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
-|---:|:---:|:---:|:---:|:---:|:---:|
-|Med42v2-70B|64.36|87.12|73.20|79.10|83.80|
-|Med42v2-8B|54.30|75.76|61.34|62.84|67.04|
-|OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
-|GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
-|MedGemini*|-|-|-|84.00|-|
-|Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
-|Med42|-|76.72|60.90|61.50|71.85|
-|ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
-|GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
-**For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
-<sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
 ### Open-ended question generation
 To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
@@ -170,19 +150,41 @@ Which response is of higher overall quality in a medical context? Consider:
 #### Elo Ratings
 |Models|Elo Score|
 |:---:|:---:|
-|Med42-v2-70B| --- |
-|Llama3-70B-Instruct| --- |
-|GPT4-o| --- |
-|Med42-v2-8B| --- |
-|Llama3-8B-Instruct| --- |
-|Mixtral-8x7b-Instruct| --- |
-|OpenBioLLM-70B| --- |
-|JSL-MedLlama-3-8B-v2.0| --- |
 #### Win-rate
 [Include Image]
 ## Accessing Med42 and Reporting Issues

 |Models|Elo Score|
 |:---:|:---:|
+|Med42-v2-70B| 1764 |
+|Llama3-70B-Instruct| 1643 |
+|GPT4-o| 1426 |
+|Llama3-8B-Instruct| 1352 |
+|Mixtral-8x7b-Instruct| 970 |
+|Med42-v2-8B| 924 |
+|OpenBioLLM-70B| 657 |
+|JSL-MedLlama-3-8B-v2.0| 447 |
 ## Limitations & Safe Use
 ## Evaluation Results
 ### Open-ended question generation
 To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
 #### Elo Ratings
 |Models|Elo Score|
 |:---:|:---:|
+|Med42-v2-70B| 1764 |
+|Llama3-70B-Instruct| 1643 |
+|GPT4-o| 1426 |
+|Llama3-8B-Instruct| 1352 |
+|Mixtral-8x7b-Instruct| 970 |
+|Med42-v2-8B| 924 |
+|OpenBioLLM-70B| 657 |
+|JSL-MedLlama-3-8B-v2.0| 447 |
 #### Win-rate
 [Include Image]
+### MCQA Evaluation
+Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
+|Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
+|---:|:---:|:---:|:---:|:---:|:---:|
+|Med42v2-70B|64.36|87.12|73.20|79.10|83.80|
+|Med42v2-8B|54.30|75.76|61.34|62.84|67.04|
+|OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
+|GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
+|MedGemini*|-|-|-|84.00|-|
+|Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
+|Med42|-|76.72|60.90|61.50|71.85|
+|ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
+|GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
+|Llama3-8B-Instruct|-|-|-|-|-|
+|Llama3-70B-Instruct|-|-|-|-|-|
+**For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
+<sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
 ## Accessing Med42 and Reporting Issues