gokceuludogan
/

WarmMolGenOne

Text2Text Generation

encoder-decoder

molecule-generation

cheminformatics

targeted-drug-design

biochemical-language-models

Model card Files Files and versions Community

WarmMolGenOne / README.md

gokceuludogan's picture

Update README.md

1cbaeb7 over 2 years ago

|

2.79 kB

	---
	license: mit
	inference: false
	tags:
	- molecule-generation
	- cheminformatics
	- targeted-drug-design
	- biochemical-language-models
	---

	## WarmMolGenOne

	A target specific molecule generator model which is warm started (i.e. initialized) from pretrained biochemical language models and trained on interacting protein-compound pairs, viewing targeted molecular generation as a translation task between protein and molecular languages. It was introduced in the paper, "Exploiting pretrained biochemical language models for
	targeted drug design", which has been accepted for publication in Bioinformatics Published by Oxford University Press and first released in [this repository](https://github.com/boun-tabi/biochemical-lms-for-drug-design).

	WarmMolGenOne is a Transformer-based encoder-decoder model initialized with [Protein RoBERTa](https://github.com/PaccMann/paccmann_proteomics) and [ChemBERTa](https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_450k) checkpoints and trained on interacting protein-compound pairs filtered from [BindingDB](https://www.bindingdb.org/rwd/bind/index.jsp). The model takes a protein sequence as an input and outputs a SMILES sequence.

	## How to use

	```python
	from transformers import EncoderDecoderModel, RobertaTokenizer, pipeline
	protein_tokenizer = RobertaTokenizer.from_pretrained("gokceuludogan/WarmMolGenOne")
	mol_tokenizer = RobertaTokenizer.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
	model = EncoderDecoderModel.from_pretrained("gokceuludogan/WarmMolGenOne")
	inputs = protein_tokenizer("MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTG", >>> return_tensors="pt")
	outputs = model.generate(**inputs, decoder_start_token_id=mol_tokenizer.bos_token_id,
	eos_token_id=mol_tokenizer.eos_token_id, pad_token_id=mol_tokenizer.eos_token_id,
	max_length=128, num_return_sequences=5, do_sample=True, top_p=0.95)
	mol_tokenizer.batch_decode(outputs, skip_special_tokens=True)
	# Sample output
	# ['Cn1cc(nn1)-c1ccccc1NS(=O)(=O)c1ccc2[nH]ccc2c1',
	# 'CC(C)(C)c1[se]nc2sc(cc12)C(O)=O',
	# '[O-][N+](=O)c1ccc(CN2CCC(CC2)NC(=O)c2cccc3ccccc23)cc1',
	# 'OC(=O)CNC(=O)CCC\\C=C\\CN1[C@@H](Cc2cn(nn2)-c2ccccc2)C(=O)N[C@@H](CCCN2C(S)=NC(C)(C2=O)c2ccc(F)cc2)C1=O',
	# 'OCC1(CCC1)C(=O)NCC1CCN(CC1)c1nc(c(s1)-c1ccc2OCOc2c1)C(O)=O']
	```

	## Citation

	```bibtex
	@article{10.1093/bioinformatics/btac482,
	author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O. and Karalı, Nilgün Lütfiye and Özgür, Arzucan},
	title = "{Exploiting Pretrained Biochemical Language Models for Targeted Drug Design}",
	journal = {Bioinformatics},
	year = {2022},
	doi = {10.1093/bioinformatics/btac482},
	url = {https://doi.org/10.1093/bioinformatics/btac482}
	}
	```