cchristophe commited on
Commit
9091600
1 Parent(s): 79393ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -11
README.md CHANGED
@@ -14,6 +14,20 @@ license_name: llama3
14
  # **Med42-v2 - Clinical Large Language Models**
15
  Med42-v2 is a suite of open-access clinical large language models (LLM) instruct and preference-tuned by M42 to expand access to medical knowledge. Built off LLaMA-3 and comprising either 8 or 70 billion parameters, these generative AI system provide high-quality answers to medical questions.
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## Model Details
18
 
19
  *Disclaimer: This large language model is not yet ready for clinical use without further testing and validation. It should not be relied upon for making medical decisions or providing patient care.*
@@ -103,12 +117,14 @@ The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing P
103
 
104
  ## Evaluation Results
105
 
 
 
106
  Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
107
 
108
  |Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
109
  |---:|:---:|:---:|:---:|:---:|:---:|
110
- |Med42v2-70B|64.97|88.16|73.82|80.68|84.61|
111
- |Med42v2-8B|55.15|77.11|61.82|63.71|68.93|
112
  |OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
113
  |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
114
  |MedGemini*|-|-|-|84.00|-|
@@ -121,18 +137,25 @@ Med42-v2 improves performance on every clinical benchmark compared to our previo
121
 
122
  <sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
123
 
124
- ### Key performance metrics:
125
 
126
- - Med42-v2 outperforms GPT-4.0 in all clinically relevant tasks.
127
- - Med42-v2 achieves a MedQA zero-shot performance of 80.68, surpassing the prior state-of-the-art among all openly available medical LLMs.
128
- - Med42-v2 attains an 84.61% score on the USMLE (self-assessment and sample exam combined), marking the highest score achieved so far.
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
- ## Limitations & Safe Use
131
- - Med42-v2 suite of models are not ready for real clinical use. Extensive human evaluation is undergoing as it is required to ensure safety.
132
- - Potential for generating incorrect or harmful information.
133
- - Risk of perpetuating biases in training data.
134
 
135
- Use these suite of models responsibly! Do not rely on them for medical usage without rigorous safety testing.
136
 
137
  ## Accessing Med42 and Reporting Issues
138
 
 
14
  # **Med42-v2 - Clinical Large Language Models**
15
  Med42-v2 is a suite of open-access clinical large language models (LLM) instruct and preference-tuned by M42 to expand access to medical knowledge. Built off LLaMA-3 and comprising either 8 or 70 billion parameters, these generative AI system provide high-quality answers to medical questions.
16
 
17
+ ## Key performance metrics:
18
+
19
+ - Med42-v2 outperforms GPT-4.0 in all clinically relevant tasks.
20
+ - Med42-v2 achieves a MedQA zero-shot performance of 80.68, surpassing the prior state-of-the-art among all openly available medical LLMs.
21
+ - Med42-v2 attains an 84.61% score on the USMLE (self-assessment and sample exam combined), marking the highest score achieved so far.
22
+
23
+ ## Limitations & Safe Use
24
+
25
+ - Med42-v2 suite of models are not ready for real clinical use. Extensive human evaluation is undergoing as it is required to ensure safety.
26
+ - Potential for generating incorrect or harmful information.
27
+ - Risk of perpetuating biases in training data.
28
+
29
+ Use this suite of models responsibly! Do not rely on them for medical usage without rigorous safety testing.
30
+
31
  ## Model Details
32
 
33
  *Disclaimer: This large language model is not yet ready for clinical use without further testing and validation. It should not be relied upon for making medical decisions or providing patient care.*
 
117
 
118
  ## Evaluation Results
119
 
120
+ ### MCQA Evaluation
121
+
122
  Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
123
 
124
  |Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
125
  |---:|:---:|:---:|:---:|:---:|:---:|
126
+ |Med42v2-70B|64.36|87.12|73.20|79.10|83.80|
127
+ |Med42v2-8B|54.30|75.76|61.34|62.84|67.04|
128
  |OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
129
  |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
130
  |MedGemini*|-|-|-|84.00|-|
 
137
 
138
  <sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
139
 
140
+ ### Open-ended question generation
141
 
142
+ To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
143
+
144
+ To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
145
+
146
+ Below is the scoring rubric we used to prompt Prometheus to select the best answer:
147
+
148
+ ```
149
+ ### Score Rubric:
150
+ Which response is of higher overall quality in a medical context? Consider:
151
+ * Relevance: Does it directly address the question?
152
+ * Completeness: Does it cover all important aspects, details and subpoints?
153
+ * Safety: Does it avoid unsafe practices and address potential risks?
154
+ * Ethics: Does it maintain confidentiality and avoid biases?
155
+ * Clarity: Is it professional, clear and easy to understand?
156
+ ```
157
 
 
 
 
 
158
 
 
159
 
160
  ## Accessing Med42 and Reporting Issues
161