m42-health
/

Llama3-Med42-70B

@@ -16,9 +16,20 @@ Med42-v2 is a suite of open-access clinical large language models (LLM) instruct
 ## Key performance metrics:
-- Med42-v2 outperforms GPT-4.0 in all clinically relevant tasks.
-- Med42-v2 achieves a MedQA zero-shot performance of 80.68, surpassing the prior state-of-the-art among all openly available medical LLMs.
-- Med42-v2 attains an 84.61% score on the USMLE (self-assessment and sample exam combined), marking the highest score achieved so far.
 ## Limitations & Safe Use
@@ -65,7 +76,7 @@ You can use the 🤗 Transformers library `text-generation` pipeline to do infer
 import transformers
 import torch
-model_name_or_path = "m42-health/Llama3-Med42-70B"
 pipeline = transformers.pipeline(
     "text-generation",
@@ -129,9 +140,9 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
 |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
 |MedGemini*|-|-|-|84.00|-|
 |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
-|Med42|76.72|76.72|60.90|61.50|71.85|
-|ClinicalCamel-70B|69.75|69.75|47.00|53.40|54.30|
-|GPT-3.5<sup>&dagger;</sup>|66.63|66.63|50.10|50.80|53.00|
 **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
@@ -139,7 +150,7 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
 ### Open-ended question generation
-To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
 To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
@@ -155,6 +166,20 @@ Which response is of higher overall quality in a medical context? Consider:
 * Clarity: Is it professional, clear and easy to understand?
 ```
 ## Accessing Med42 and Reporting Issues

 ## Key performance metrics:
+- Med42-v2-70B outperforms GPT-4.0 in most of the MCQA tasks.
+- Med42-v2-70B achieves a MedQA zero-shot performance of 79.10, surpassing the prior state-of-the-art among all openly available medical LLMs.
+- Med42-v2-70B sit at the top of the Clinical Elo Rating Leaderboard.
+|Models|Elo Score|
+|Med42-v2-70B| --- |
+|Llama3-70B-Instruct| --- |
+|GPT4-o| --- |
+|Med42-v2-8B| --- |
+|Llama3-8B-Instruct| --- |
+|Mixtral-8x7b-Instruct| --- |
+|OpenBioLLM-70B| --- |
+|JSL-MedLlama-3-8B-v2.0| --- |
 ## Limitations & Safe Use
 import transformers
 import torch
+model_name_or_path = "m42-health/Llama3-Med42-DPO-70B"
 pipeline = transformers.pipeline(
     "text-generation",
 |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
 |MedGemini*|-|-|-|84.00|-|
 |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
+|Med42|-|76.72|60.90|61.50|71.85|
+|ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
+|GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
 **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
 ### Open-ended question generation
+To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
 To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
 * Clarity: Is it professional, clear and easy to understand?
 ```
+#### Elo Ratings
+|Models|Elo Score|
+|Med42-v2-70B| --- |
+|Llama3-70B-Instruct| --- |
+|GPT4-o| --- |
+|Med42-v2-8B| --- |
+|Llama3-8B-Instruct| --- |
+|Mixtral-8x7b-Instruct| --- |
+|OpenBioLLM-70B| --- |
+|JSL-MedLlama-3-8B-v2.0| --- |
+#### Win-rate
+[Include Image]
 ## Accessing Med42 and Reporting Issues