cchristophe commited on
Commit
d0d3192
1 Parent(s): 9091600

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -8
README.md CHANGED
@@ -16,9 +16,20 @@ Med42-v2 is a suite of open-access clinical large language models (LLM) instruct
16
 
17
  ## Key performance metrics:
18
 
19
- - Med42-v2 outperforms GPT-4.0 in all clinically relevant tasks.
20
- - Med42-v2 achieves a MedQA zero-shot performance of 80.68, surpassing the prior state-of-the-art among all openly available medical LLMs.
21
- - Med42-v2 attains an 84.61% score on the USMLE (self-assessment and sample exam combined), marking the highest score achieved so far.
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Limitations & Safe Use
24
 
@@ -65,7 +76,7 @@ You can use the 🤗 Transformers library `text-generation` pipeline to do infer
65
  import transformers
66
  import torch
67
 
68
- model_name_or_path = "m42-health/Llama3-Med42-70B"
69
 
70
  pipeline = transformers.pipeline(
71
  "text-generation",
@@ -129,9 +140,9 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
129
  |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
130
  |MedGemini*|-|-|-|84.00|-|
131
  |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
132
- |Med42|76.72|76.72|60.90|61.50|71.85|
133
- |ClinicalCamel-70B|69.75|69.75|47.00|53.40|54.30|
134
- |GPT-3.5<sup>&dagger;</sup>|66.63|66.63|50.10|50.80|53.00|
135
 
136
  **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
137
 
@@ -139,7 +150,7 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
139
 
140
  ### Open-ended question generation
141
 
142
- To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
143
 
144
  To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
145
 
@@ -155,6 +166,20 @@ Which response is of higher overall quality in a medical context? Consider:
155
  * Clarity: Is it professional, clear and easy to understand?
156
  ```
157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
 
160
  ## Accessing Med42 and Reporting Issues
 
16
 
17
  ## Key performance metrics:
18
 
19
+ - Med42-v2-70B outperforms GPT-4.0 in most of the MCQA tasks.
20
+ - Med42-v2-70B achieves a MedQA zero-shot performance of 79.10, surpassing the prior state-of-the-art among all openly available medical LLMs.
21
+ - Med42-v2-70B sit at the top of the Clinical Elo Rating Leaderboard.
22
+
23
+ |Models|Elo Score|
24
+ |Med42-v2-70B| --- |
25
+ |Llama3-70B-Instruct| --- |
26
+ |GPT4-o| --- |
27
+ |Med42-v2-8B| --- |
28
+ |Llama3-8B-Instruct| --- |
29
+ |Mixtral-8x7b-Instruct| --- |
30
+ |OpenBioLLM-70B| --- |
31
+ |JSL-MedLlama-3-8B-v2.0| --- |
32
+
33
 
34
  ## Limitations & Safe Use
35
 
 
76
  import transformers
77
  import torch
78
 
79
+ model_name_or_path = "m42-health/Llama3-Med42-DPO-70B"
80
 
81
  pipeline = transformers.pipeline(
82
  "text-generation",
 
140
  |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
141
  |MedGemini*|-|-|-|84.00|-|
142
  |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
143
+ |Med42|-|76.72|60.90|61.50|71.85|
144
+ |ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
145
+ |GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
146
 
147
  **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
148
 
 
150
 
151
  ### Open-ended question generation
152
 
153
+ To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
154
 
155
  To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
156
 
 
166
  * Clarity: Is it professional, clear and easy to understand?
167
  ```
168
 
169
+ #### Elo Ratings
170
+ |Models|Elo Score|
171
+ |Med42-v2-70B| --- |
172
+ |Llama3-70B-Instruct| --- |
173
+ |GPT4-o| --- |
174
+ |Med42-v2-8B| --- |
175
+ |Llama3-8B-Instruct| --- |
176
+ |Mixtral-8x7b-Instruct| --- |
177
+ |OpenBioLLM-70B| --- |
178
+ |JSL-MedLlama-3-8B-v2.0| --- |
179
+
180
+ #### Win-rate
181
+
182
+ [Include Image]
183
 
184
 
185
  ## Accessing Med42 and Reporting Issues