Spaces:
Running
Model documentation & parameters
Algorithm Version: Which model version (either protein-target-driven or gene-expression-profile-driven) to use and which checkpoint to rely on.
Inference type: Whether the model should be conditioned on the target (default) or whether the model is used in an Unbiased
manner.
Protein target: An AAS of a protein target used for conditioning. Only use if Inference type
is Conditional
and if the Algorithm version
is a Protein model.
Gene expression target: A list of 2128 floats, representing the embedding of gene expression profile to be used for conditioning. Only use if Inference type
is Conditional
and if the Algorithm version
is a Omic model.
Decoding temperature: The temperature parameter in the SMILES/SELFIES decoder. Higher values lead to more explorative choices, smaller values culminate in mode collapse.
Maximal sequence length: The maximal number of SMILES tokens in the generated molecule.
Number of samples: How many samples should be generated (between 1 and 50).
Model card -- PaccMannRL
Model Details: PaccMannRL is a language model for conditional molecular design. It consists of a domain-specific encoder (for protein targets or gene expression profiles) and a generic molecular decoder. Both components are finetuned together using RL to convert the context representation into a molecule with high affinity toward the context (i.e., binding affinity to the protein or high inhibitory effect for the cell profile).
Developers: Jannis Born, Matteo Manica and colleagues from IBM Research.
Distributors: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.
Model date: Published in 2021.
Model version: Models trained and distribuetd by the original authors.
- Protein_v0: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Protein encoder pretrained on 404k proteins from UniProt. Encoder and decoder finetuned on 41 SARS-CoV-2-related protein targets with a binding affinity predictor trained on BindingDB.
- Omic_v0: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Gene expression encoder pretrained on 12k gene expression profiles from TCGA. Encoder and decoder finetuned on a few hundred cancer cell profiles from GDSC with a IC50 predictor trained on GDSC.
Model type: A language-based molecular generative model that can be optimized with RL to generate molecules with high affinity toward a context.
Information about training algorithms, parameters, fairness constraints or other applied approaches, and features:
- Protein: Parameters as provided on (GitHub repo).
- Omics: Parameters as provided on (GitHub repo).
Paper or other resource for more information:
- Protein: PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning (2021; iScience).
- Omics: Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2 (2021; Machine Learning: Science and Technology).
License: MIT
Where to send questions or comments about the model: Open an issue on GT4SD repository.
Intended Use. Use cases that were envisioned during development: Chemical research, in particular drug discovery.
Primary intended uses/users: Researchers and computational chemists using the model for model comparison or research exploration purposes.
Out-of-scope use cases: Production-level inference, producing molecules with harmful properties.
Factors: Not applicable.
Metrics: High reward on generating molecules with high affinity toward context.
Datasets: ChEMBL, UniProt, GDSC and BindingDB (see above).
Ethical Considerations: Unclear, please consult with original authors in case of questions.
Caveats and Recommendations: Unclear, please consult with original authors in case of questions.
Model card prototype inspired by Mitchell et al. (2019)
Citation
Omics:
@article{born2021paccmannrl,
title = {PaccMann\textsuperscript{RL}: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning},
journal = {iScience},
volume = {24},
number = {4},
pages = {102269},
year = {2021},
issn = {2589-0042},
doi = {https://doi.org/10.1016/j.isci.2021.102269},
url = {https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6},
author = {Born, Jannis and Manica, Matteo and Oskooei, Ali and Cadow, Joris and Markert, Greta and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a}
}
Proteins:
@article{born2021datadriven,
author = {Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and Mill, Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and Cardinale, Antonio and Laino, Teodoro and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a},
doi = {10.1088/2632-2153/abe808},
issn = {2632-2153},
journal = {Machine Learning: Science and Technology},
number = {2},
pages = {025024},
title = {{Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2}},
url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808},
volume = {2},
year = {2021}
}