Model Card for Mirrorbert Medroberta.Nl Meantoken

The model was trained on medical entity triplets (anchor, term, synonym)

Expected input and output

The input should be a string of biomedical entity names, e.g., "covid infection" or "Hydroxychloroquine". The [CLS] embedding of the last layer is regarded as the output.

Extracting embeddings from mirrorbert_MedRoBERTa.nl_meantoken

The following script converts a list of strings (entity names) into embeddings.

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("UMCU/mirrorbert_MedRoBERTa.nl_meantoken")
model = AutoModel.from_pretrained("UMCU/mirrorbert_MedRoBERTa.nl_meantoken").cuda()

# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
                                       padding="max_length",
                                       max_length=25,
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0].mean(1) 
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

Data description

Hard Dutch UMLS/SNOMED synonym pairs (terms referring to the same CUI/SCUI),and including English medication names

Acknowledgement

This is part of the DT4H project.

Doi and reference

For more details about training and eval, see MirrorBERT github repo.

Citation

@inproceedings{liu-etal-2021-fast,
    title = "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders",
    author = "Liu, Fangyu  and
      Vuli{'c}, Ivan  and
      Korhonen, Anna  and
      Collier, Nigel",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.109",
    pages = "1442--1459",
}

For more details about training/eval and other scripts, see CardioNER github repo. and for more information on the background, see Datatools4Heart Huggingface/Website

Downloads last month
17
Safetensors
Model size
126M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.