File size: 1,572 Bytes
7be4960
 
 
99fb1d7
1468a9b
af6a91c
 
 
f75ec25
af6a91c
 
 
 
 
 
 
 
 
f75ec25
99fb1d7
4c8e9d5
 
 
 
edc87b2
4c8e9d5
99fb1d7
 
 
 
 
 
edc87b2
99fb1d7
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
license: gpl-3.0
---

Pre-trained word embeddings using the text of published clinical case reports. These embeddings use 100 dimensions and were trained using the word2vec algorithm on published clinical case reports found in the [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). See the paper here: https://pubmed.ncbi.nlm.nih.gov/34920127/

Citation:

```
@article{flamholz2022word,
  title={Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information},
  author={Flamholz, Zachary N and Crane-Droesch, Andrew and Ungar, Lyle H and Weissman, Gary E},
  journal={Journal of Biomedical Informatics},
  volume={125},
  pages={103971},
  year={2022},
  publisher={Elsevier}
}
```

## Quick start

Word embeddings are compatible with the [`gensim` Python package](https://radimrehurek.com/gensim/) format.

First download the files from this archive. Then load the embeddings into Python.


```python

from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models

# Load the model
model = Word2Vec.load('w2v_oa_cr_100d.bin')

# Return 100-dimensional vector representations of each word
model.wv.word_vec('diabetes')
model.wv.word_vec('cardiac_arrest')
model.wv.word_vec('lymphangioleiomyomatosis')

# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')

```