--- language: - en license: llama3 tags: - m42 - health - healthcare - clinical-llm pipeline_tag: text-generation inference: false license_name: llama3 --- # **Med42-v2 - Clinical Large Language Models** Med42-v2 is a suite of open-access clinical large language models (LLM) instruct-tuned by M42 to expand access to medical knowledge. Built off LLaMA-3 and comprising either 8 or 70 billion parameters, these generative AI system provide high-quality answers to medical questions. ## Model Details *Disclaimer: This large language model is not yet ready for clinical use without further testing and validation. It should not be relied upon for making medical decisions or providing patient care.* Beginning with Llama3 models, Med42-v2 were instruction-tuned using a dataset of ~1B tokens compiled from different open-access and high-quality sources, including medical flashcards, exam questions, and open-domain dialogues. **Model Developers:** M42 Health AI Team **Finetuned from model:** Llama3 - 8B & 70B Instruct **Context length:** 8k tokens **Input:** Text only data **Output:** Model generates text only **Status:** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we enhance model's performance. **License:** Llama 3 Community License Agreement **Research Paper:** *Comming soon* ## Intended Use Med42-v2 suite of models are being made available for further testing and assessment as AI assistants to enhance clinical decision-making and enhance access to LLMs for healthcare use. Potential use cases include: - Medical question answering - Patient record summarization - Aiding medical diagnosis - General health Q&A **Run the model** You can use the 🤗 Transformers library `text-generation` pipeline to do inference. ```python import transformers import torch model_name_or_path = "m42-health/Llama3-Med42-70B" pipeline = transformers.pipeline( "text-generation", model=model_name_or_path, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ { "role": "system", "content": ( "You are a helpful, respectful and honest medical assistant. You are a second version of Med42 developed by the AI team at M42, UAE. " "Always answer as helpfully as possible, while being safe. " "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. " "Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. " "If you don’t know the answer to a question, please don’t share false information." ), }, {"role": "user", "content": "What are the symptoms of diabetes?"}, ] prompt = pipeline.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False ) stop_tokens = [ pipeline.tokenizer.eos_token_id, pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] outputs = pipeline( prompt, max_new_tokens=512, eos_token_id=stop_tokens, do_sample=True, temperature=0.4, top_k=150, top_p=0.75, ) print(outputs[0]["generated_text"][len(prompt) :]) ``` ## Hardware and Software The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing PyTorch's Fully Sharded Data Parallel (FSDP) framework. ## Evaluation Results Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.". |Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE| |---:|:---:|:---:|:---:|:---:|:---:| |Med42v2-70B|64.97|88.16|73.82|80.68|84.61| |Med42v2-8B|55.15|77.11|61.82|63.71|68.93| |OpenBioLLM|64.24|90.40|73.18|76.90|79.01| |GPT-4.0|-|87.00|69.50|78.90|84.05| |MedGemini*|-|-|-|84.00|-| |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-| |Med42|76.72|76.72|60.90|61.50|71.85| |ClinicalCamel-70B|69.75|69.75|47.00|53.40|54.30| |GPT-3.5|66.63|66.63|50.10|50.80|53.00| **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*. *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*. ### Key performance metrics: - Med42-v2 outperforms GPT-4.0 in all clinically relevant tasks. - Med42-v2 achieves a MedQA zero-shot performance of 80.68, surpassing the prior state-of-the-art among all openly available medical LLMs. - Med42-v2 attains an 84.61% score on the USMLE (self-assessment and sample exam combined), marking the highest score achieved so far. ## Limitations & Safe Use - Med42-v2 suite of models are not ready for real clinical use. Extensive human evaluation is undergoing as it is required to ensure safety. - Potential for generating incorrect or harmful information. - Risk of perpetuating biases in training data. Use these suite of models responsibly! Do not rely on them for medical usage without rigorous safety testing. ## Accessing Med42 and Reporting Issues Please report any software "bug" or other problems through one of the following means: - Reporting issues with the model: [https://github.com/m42health/med42](https://github.com/m42health/med42) - Reporting risky content generated by the model, bugs and/or any security concerns: [https://forms.office.com/r/fPY4Ksecgf](https://forms.office.com/r/fPY4Ksecgf) - M42’s privacy policy available at [https://m42.ae/privacy-policy/](https://m42.ae/privacy-policy/) - Reporting violations of the Acceptable Use Policy or unlicensed uses of Med42: ## Citation ``` @article{christophe2023med42, title={Med42v2}, author={Christophe, Cl{\'e}ment and Hayat, Nasir and Kanithi, Praveen and Al-Mahrooqi, Ahmed and Munjal, Prateek and Pimentel, Marco and Raha, Tathagata and Rajan, Ronnie and Khan, Shadab}, journal={M42}, year={2023} } ```