Update README.md
#1
by
ronnierajan
- opened
README.md
CHANGED
@@ -12,7 +12,7 @@ inference: false
|
|
12 |
license_name: llama3
|
13 |
---
|
14 |
# **Med42-v2 - A Suite of Clinically-aligned Large Language Models**
|
15 |
-
Med42-v2 is a suite of open-access clinical large language models (LLM) instruct and preference-tuned by M42 to expand access to medical knowledge. Built off LLaMA-3 and comprising either 8 or 70 billion parameters, these generative AI
|
16 |
|
17 |
## Key performance metrics:
|
18 |
|
@@ -34,7 +34,7 @@ Med42-v2 is a suite of open-access clinical large language models (LLM) instruct
|
|
34 |
|
35 |
## Limitations & Safe Use
|
36 |
|
37 |
-
- Med42-v2 suite of models
|
38 |
- Potential for generating incorrect or harmful information.
|
39 |
- Risk of perpetuating biases in training data.
|
40 |
|
@@ -56,14 +56,14 @@ Beginning with Llama3 models, Med42-v2 were instruction-tuned using a dataset of
|
|
56 |
|
57 |
**Output:** Model generates text only
|
58 |
|
59 |
-
**Status:** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we enhance model's performance.
|
60 |
|
61 |
**License:** Llama 3 Community License Agreement
|
62 |
|
63 |
-
**Research Paper:** *
|
64 |
|
65 |
## Intended Use
|
66 |
-
Med42-v2 suite of models
|
67 |
- Medical question answering
|
68 |
- Patient record summarization
|
69 |
- Aiding medical diagnosis
|
@@ -131,7 +131,7 @@ The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing P
|
|
131 |
|
132 |
### Open-ended question generation
|
133 |
|
134 |
-
To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated
|
135 |
|
136 |
To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
|
137 |
|
@@ -166,7 +166,7 @@ Which response is of higher overall quality in a medical context? Consider:
|
|
166 |
|
167 |
### MCQA Evaluation
|
168 |
|
169 |
-
Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
|
170 |
|
171 |
|Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
|
172 |
|---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
12 |
license_name: llama3
|
13 |
---
|
14 |
# **Med42-v2 - A Suite of Clinically-aligned Large Language Models**
|
15 |
+
Med42-v2 is a suite of open-access clinical large language models (LLM) instruct and preference-tuned by M42 to expand access to medical knowledge. Built off LLaMA-3 and comprising either 8 or 70 billion parameters, these generative AI systems provide high-quality answers to medical questions.
|
16 |
|
17 |
## Key performance metrics:
|
18 |
|
|
|
34 |
|
35 |
## Limitations & Safe Use
|
36 |
|
37 |
+
- The Med42-v2 suite of models is not ready for real clinical use. Extensive human evaluation is undergoing as it is essential to ensure safety.
|
38 |
- Potential for generating incorrect or harmful information.
|
39 |
- Risk of perpetuating biases in training data.
|
40 |
|
|
|
56 |
|
57 |
**Output:** Model generates text only
|
58 |
|
59 |
+
**Status:** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we enhance the model's performance.
|
60 |
|
61 |
**License:** Llama 3 Community License Agreement
|
62 |
|
63 |
+
**Research Paper:** *Coming soon*
|
64 |
|
65 |
## Intended Use
|
66 |
+
The Med42-v2 suite of models is being made available for further testing and assessment as AI assistants to enhance clinical decision-making and access to LLMs for healthcare use. Potential use cases include:
|
67 |
- Medical question answering
|
68 |
- Patient record summarization
|
69 |
- Aiding medical diagnosis
|
|
|
131 |
|
132 |
### Open-ended question generation
|
133 |
|
134 |
+
To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses 4,000 carefully curated publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
|
135 |
|
136 |
To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
|
137 |
|
|
|
166 |
|
167 |
### MCQA Evaluation
|
168 |
|
169 |
+
Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics, and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
|
170 |
|
171 |
|Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
|
172 |
|---:|:---:|:---:|:---:|:---:|:---:|
|