File size: 2,785 Bytes
adbeb10 c81dc45 0e3854e 86bed4e adbeb10 5589950 1cbaeb7 5589950 1cbaeb7 5589950 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
license: mit
inference: false
tags:
- molecule-generation
- cheminformatics
- targeted-drug-design
- biochemical-language-models
---
## WarmMolGenOne
A target specific molecule generator model which is warm started (i.e. initialized) from pretrained biochemical language models and trained on interacting protein-compound pairs, viewing targeted molecular generation as a translation task between protein and molecular languages. It was introduced in the paper, "Exploiting pretrained biochemical language models for
targeted drug design", which has been accepted for publication in *Bioinformatics* Published by Oxford University Press and first released in [this repository](https://github.com/boun-tabi/biochemical-lms-for-drug-design).
WarmMolGenOne is a Transformer-based encoder-decoder model initialized with [Protein RoBERTa](https://github.com/PaccMann/paccmann_proteomics) and [ChemBERTa](https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_450k) checkpoints and trained on interacting protein-compound pairs filtered from [BindingDB](https://www.bindingdb.org/rwd/bind/index.jsp). The model takes a protein sequence as an input and outputs a SMILES sequence.
## How to use
```python
from transformers import EncoderDecoderModel, RobertaTokenizer, pipeline
protein_tokenizer = RobertaTokenizer.from_pretrained("gokceuludogan/WarmMolGenOne")
mol_tokenizer = RobertaTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
model = EncoderDecoderModel.from_pretrained("gokceuludogan/WarmMolGenOne")
inputs = protein_tokenizer("MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTG", >>> return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=mol_tokenizer.bos_token_id,
eos_token_id=mol_tokenizer.eos_token_id, pad_token_id=mol_tokenizer.eos_token_id,
max_length=128, num_return_sequences=5, do_sample=True, top_p=0.95)
mol_tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Sample output
# ['Cn1cc(nn1)-c1ccccc1NS(=O)(=O)c1ccc2[nH]ccc2c1',
# 'CC(C)(C)c1[se]nc2sc(cc12)C(O)=O',
# '[O-][N+](=O)c1ccc(CN2CCC(CC2)NC(=O)c2cccc3ccccc23)cc1',
# 'OC(=O)CNC(=O)CCC\\C=C\\CN1[C@@H](Cc2cn(nn2)-c2ccccc2)C(=O)N[C@@H](CCCN2C(S)=NC(C)(C2=O)c2ccc(F)cc2)C1=O',
# 'OCC1(CCC1)C(=O)NCC1CCN(CC1)c1nc(c(s1)-c1ccc2OCOc2c1)C(O)=O']
```
## Citation
```bibtex
@article{10.1093/bioinformatics/btac482,
author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O. and Karalı, Nilgün Lütfiye and Özgür, Arzucan},
title = "{Exploiting Pretrained Biochemical Language Models for Targeted Drug Design}",
journal = {Bioinformatics},
year = {2022},
doi = {10.1093/bioinformatics/btac482},
url = {https://doi.org/10.1093/bioinformatics/btac482}
}
```
|