Me-LLaMA

Model Overview

The Me-LLaMA model consists of two foundation models: Me-LLaMA 13B and Me-LLaMA 70B, along with their chat-enhanced counterparts, Me-LLaMA 13B-chat and Me-LLaMA 70B-chat. These models are designed for superior chat and instruction-following capabilities. The Me-LLaMA 13B and 70B were continually pretrained from the base LLaMA 2 13B and 70B models with the addition of biomedical, clinical, and general domain data. The chat versions were further instruction-tuned using comprehensive medical instruction tuning data.

Pretraining and Data

Me-LLaMA was developed through continual pre-training and instruction tuning of LLaMA2, incorporating 129B tokens and 214K instruction tuning samples from general, biomedical, and clinical domains. The pretraining data consists of biomedical literature, clinical notes, and general domain data in a 15:1:4 ratio, sourced from:

Biomedical: PubMed Central and PubMed Abstracts (Pile dataset)
Clinical: De-identified free-text clinical notes from MIMIC III, MIMIC-IV, and MIMIC-CXR
General Domain: Subset from the RedPajama dataset

The instruction tuning dataset includes:

General Domain: Alpaca, Dolly, and ShareGPT datasets
Biomedical: HealthCareMagic, Icliniq, MedInstruct, Medical Flash Cards, MEDIQA, MedicationQA, LiveQA, WikiDocPatient, Guideline QA, Pubmed Central, Pubmed, UMLS Knowledge graph
Clinical: MIMIC-III and MIMIC-IV

Evaluation

Me-LLaMA was evaluated on 12 datasets across different tasks:

QA: PubMedQA, MedQA, MedMCQA, EmrQA
NER: 2010 i2b2
Relation Extraction: 2013 DDI
Classification: HoC, MTSample
Text Summarization: PubMed, MIMIC-CXR
NLI: BioNLI, MedNLI

Performance

Me-LLaMA 13B: Surpassed PMC-LLaMA 13B on 11/12 datasets and LLaMA2 13B on 10/12 datasets, with competitive performance against larger models like LLaMA2 70B and Meditron 70B on 8/12 datasets.
Me-LLaMA 70B: Outperformed LLaMA2 70B and Meditron 70B on 9/12 datasets.
Zero-shot setting: Outperformed ChatGPT on 5/8 datasets without privacy concerns, and on 1/8 against GPT-4.
Task-specific instruction tuning: Surpassed ChatGPT on 7/8 and GPT-4 on 5/8 datasets.

Despite having significantly fewer parameters (13B/70B vs. 175B+ for ChatGPT and GPT-4), Me-LLaMA models demonstrated impressive performance and strong abilities in supervised and in-context learning across various medical tasks.

Model Details

Included in this repository are four models:

Me-LLaMA 13B: Continually pretrained from LLaMA 2 13B.
Me-LLaMA 70B: Continually pretrained from LLaMA 2 70B.
Me-LLaMA 13B-chat: Further instruction-tuned from Me-LLaMA 13B using a variety of general, biomedical, and clinical datasets.
Me-LLaMA 70B-chat: Further instruction-tuned from Me-LLaMA 70B using a variety of general, biomedical, and clinical datasets.

Each model contains several files, which are standard with the transformers library:

config.json: Information about the model
model-x-of-y.safetensors: Model weights
generation_config.json: Settings for text generation
special_tokens_map.json: Special tokens used in training
tokenizer.json: Mapping from indices to tokens
tokenizer_config.json: Configuration file for the tokenizer

Usage

For more details and to access the models, please visit the Me-LLaMA repository on PhysioNet.

For more technical details, please visit our paper on arXiv.

clinicalnlplab
/

me-llama