readme changes

04dbd58 3 months ago

6.02 kB

	---
	tags:
	- chemistry
	widget:
	- text: <LIGAND>
	example_title: Generate molecule
	---
	# BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

	![alt text](./images/scheme.svg "Main image")

	> BindGPT is a new framework for building drug discovery models that leverages compute-efficient pretraining, supervised funetuning, prompting, reinforcement learning, and tool use of LMs. This allows BindGPT to build a single pre-trained model that exhibits state-of-the-art performance in 3D Molecule Generation, 3D Conformer Generation, Pocket-Conditioned 3D Molecule Generation, posing them as downstream tasks for a pretrained model, while previous methods build task-specialized models without task transfer abilities. At the same time, thanks to the fast transformer inference technology, BindGPT is 2 orders of magnitude (100 times) faster than previous methods at generation.
	- website: https://bindgpt.github.io
	- Repository: https://github.com/insilicomedicine/bindgpt
	- Paper: https://arxiv.org/abs/2406.03686


	This page provides the pretrained version of BindGPT.
	The pretrained model is capable of zero-shot molecule generation and conformer generation within
	the distribution of the [Uni-Mol](https://github.com/deepmodeling/Uni-Mol) dataset.
	We also expose finetuned models:

	- For the model finetuned on GEOM-DRUGS, visit [huggingface.co/insilicomedicine/bindgpt_finetuned](https://huggingface.co/insilicomedicine/bindgpt_finetuned)
	- The model finetuned with Reinforcement Learning on CrossDocked is coming soon


	## Unconditional generation

	The code below provides a minimal standalone example of
	sampling molecules from the model. It only depends on
	`transformers`, `tokenizers`, `rdkit`, and `pytorch`
	and it's not meant to reproduce the sampling speed reported
	in the paper (e.g. it does not use flash-attention, mixed precision,
	and large batch sampling).
	To reproduce sampling speed, please use the code from our repository:
	https://github.com/insilicomedicine/bindgpt

	```python
	# Download model from Hugginface:
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("insilicomedicine/bindgpt_pretrained")
	model = AutoModelForCausalLM.from_pretrained("insilicomedicine/bindgpt_pretrained").cuda()

	# Generate 10 tokenized molecules without condition
	NUM_SAMPLES = 10

	start_tokens = tokenizer("<LIGAND>", return_tensors="pt")
	outputs = model.generate(
	# remove EOS token to continue generation
	input_ids=start_tokens['input_ids'][:, :-1].cuda(),
	attention_mask=start_tokens['attention_mask'][:, :-1].cuda(),
	do_sample=True, max_length=400, num_return_sequences=NUM_SAMPLES
	)


	# parse results
	import re
	from rdkit import Chem
	def parse_molecule(s):
	try:
	assert '<LIGAND>' in s and '<XYZ>' in s
	_, smiles, xyz = re.split(r'<LIGAND>\|<XYZ>', s)
	smiles = re.sub(r'\s', '', smiles)
	conf = Chem.Conformer()
	mol = Chem.MolFromSmiles(smiles)
	assert mol is not None
	coords = list(map(float, xyz.split(' ')[2:]))
	assert len(coords) == (3 * mol.GetNumAtoms())
	for j in range(mol.GetNumAtoms()):
	conf.SetAtomPosition(j, [coords[3j],coords[3j+1],coords[3*j+2]])
	mol.AddConformer(conf)
	return mol
	except AssertionError:
	return None

	string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	molecules = [parse_molecule(mol) for mol in string_molecules]
	```

	## Conformer generation

	The code below provides a minimal standalone example of
	sampling conformers given molecule from the model. It only depends on
	`transformers`, `tokenizers`, `rdkit`, and `pytorch`
	and it's not meant to reproduce the sampling speed reported
	in the paper (e.g. it does not use flash-attention, mixed precision,
	and large batch sampling).
	To reproduce sampling speed, please use the code from our repository:
	https://github.com/insilicomedicine/bindgpt

	```python
	smiles = [
	'O=c1n(CCO)c2ccccc2n1CCO',
	'Cc1ccc(C#N)cc1S(=O)(=O)NCc1ccnc(OC(C)(C)C)c1',
	'COC(=O)Cc1csc(NC(=O)Cc2coc3cc(C)ccc23)n1',
	]

	# tell the tokenizer to right-align sequences
	tokenizer.padding_side = 'left'
	# Do not forget to add the <XYZ> token
	# after the smiles, otherwise the model might
	# want to continue generating the molecule :)
	prompts = tokenizer(
	["<LIGAND>" + s + '<XYZ>' for s in smiles], return_tensors="pt",
	truncation=True, padding=True,
	)

	# Generate 1 conformer per molecule
	outputs = model.generate(
	# remove EOS token to continue generation
	input_ids=prompts['input_ids'][:, :-1].cuda(),
	attention_mask=prompts['attention_mask'][:, :-1].cuda(),
	do_sample=True, max_length=400,
	# you can combine this type of conditional generation
	# with multi-sample generation.
	# to sample many conformers per molecule, uncomment this
	# num_return_sequences=10
	)

	# parse results
	string_molecules = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	molecules = [parse_molecule(mol) for mol in string_molecules]
	```

	## Usage and License

	Please note that all model weights are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage.
	We emphatically urge all users to adhere to the highest ethical standards when using our models, including maintaining fairness, transparency, and responsibility in their research. Any usage that may lead to harm or pose a detriment to society is strictly forbidden.


	## References
	If you use our repository, please cite the following related paper:

	```
	@article{zholus2021bindgpt,
	author = {Artem Zholus and Maksim Kuznetsov and Roman Schutski and Rim Shayakhmetov and Daniil Polykovskiy and Sarath Chandar and Alex Zhavoronkov},
	title = {BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
	journal = {arXiv},
	year = {2024},
	}
	```