cchristophe
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -16,9 +16,20 @@ Med42-v2 is a suite of open-access clinical large language models (LLM) instruct
|
|
16 |
|
17 |
## Key performance metrics:
|
18 |
|
19 |
-
- Med42-v2 outperforms GPT-4.0 in
|
20 |
-
- Med42-v2 achieves a MedQA zero-shot performance of
|
21 |
-
- Med42-v2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Limitations & Safe Use
|
24 |
|
@@ -65,7 +76,7 @@ You can use the 🤗 Transformers library `text-generation` pipeline to do infer
|
|
65 |
import transformers
|
66 |
import torch
|
67 |
|
68 |
-
model_name_or_path = "m42-health/Llama3-Med42-70B"
|
69 |
|
70 |
pipeline = transformers.pipeline(
|
71 |
"text-generation",
|
@@ -129,9 +140,9 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
|
|
129 |
|GPT-4.0<sup>†</sup>|-|87.00|69.50|78.90|84.05|
|
130 |
|MedGemini*|-|-|-|84.00|-|
|
131 |
|Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
|
132 |
-
|Med42
|
133 |
-
|ClinicalCamel-70B
|
134 |
-
|GPT-3.5<sup>†</sup
|
135 |
|
136 |
**For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
|
137 |
|
@@ -139,7 +150,7 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
|
|
139 |
|
140 |
### Open-ended question generation
|
141 |
|
142 |
-
To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
|
143 |
|
144 |
To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
|
145 |
|
@@ -155,6 +166,20 @@ Which response is of higher overall quality in a medical context? Consider:
|
|
155 |
* Clarity: Is it professional, clear and easy to understand?
|
156 |
```
|
157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
158 |
|
159 |
|
160 |
## Accessing Med42 and Reporting Issues
|
|
|
16 |
|
17 |
## Key performance metrics:
|
18 |
|
19 |
+
- Med42-v2-70B outperforms GPT-4.0 in most of the MCQA tasks.
|
20 |
+
- Med42-v2-70B achieves a MedQA zero-shot performance of 79.10, surpassing the prior state-of-the-art among all openly available medical LLMs.
|
21 |
+
- Med42-v2-70B sit at the top of the Clinical Elo Rating Leaderboard.
|
22 |
+
|
23 |
+
|Models|Elo Score|
|
24 |
+
|Med42-v2-70B| --- |
|
25 |
+
|Llama3-70B-Instruct| --- |
|
26 |
+
|GPT4-o| --- |
|
27 |
+
|Med42-v2-8B| --- |
|
28 |
+
|Llama3-8B-Instruct| --- |
|
29 |
+
|Mixtral-8x7b-Instruct| --- |
|
30 |
+
|OpenBioLLM-70B| --- |
|
31 |
+
|JSL-MedLlama-3-8B-v2.0| --- |
|
32 |
+
|
33 |
|
34 |
## Limitations & Safe Use
|
35 |
|
|
|
76 |
import transformers
|
77 |
import torch
|
78 |
|
79 |
+
model_name_or_path = "m42-health/Llama3-Med42-DPO-70B"
|
80 |
|
81 |
pipeline = transformers.pipeline(
|
82 |
"text-generation",
|
|
|
140 |
|GPT-4.0<sup>†</sup>|-|87.00|69.50|78.90|84.05|
|
141 |
|MedGemini*|-|-|-|84.00|-|
|
142 |
|Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
|
143 |
+
|Med42|-|76.72|60.90|61.50|71.85|
|
144 |
+
|ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
|
145 |
+
|GPT-3.5<sup>†</sup>|-|66.63|50.10|50.80|53.00|
|
146 |
|
147 |
**For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
|
148 |
|
|
|
150 |
|
151 |
### Open-ended question generation
|
152 |
|
153 |
+
To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
|
154 |
|
155 |
To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
|
156 |
|
|
|
166 |
* Clarity: Is it professional, clear and easy to understand?
|
167 |
```
|
168 |
|
169 |
+
#### Elo Ratings
|
170 |
+
|Models|Elo Score|
|
171 |
+
|Med42-v2-70B| --- |
|
172 |
+
|Llama3-70B-Instruct| --- |
|
173 |
+
|GPT4-o| --- |
|
174 |
+
|Med42-v2-8B| --- |
|
175 |
+
|Llama3-8B-Instruct| --- |
|
176 |
+
|Mixtral-8x7b-Instruct| --- |
|
177 |
+
|OpenBioLLM-70B| --- |
|
178 |
+
|JSL-MedLlama-3-8B-v2.0| --- |
|
179 |
+
|
180 |
+
#### Win-rate
|
181 |
+
|
182 |
+
[Include Image]
|
183 |
|
184 |
|
185 |
## Accessing Med42 and Reporting Issues
|