# Model documentation & parameters **Algorithm Version**: Which model version (either protein-target-driven or gene-expression-profile-driven) to use and which checkpoint to rely on. **Inference type**: Whether the model should be conditioned on the target (default) or whether the model is used in an `Unbiased` manner. **Protein target**: An AAS of a protein target used for conditioning. Only use if `Inference type` is `Conditional` and if the `Algorithm version` is a Protein model. **Gene expression target**: A list of 2128 floats, representing the embedding of gene expression profile to be used for conditioning. Only use if `Inference type` is `Conditional` and if the `Algorithm version` is a Omic model. **Decoding temperature**: The temperature parameter in the SMILES/SELFIES decoder. Higher values lead to more explorative choices, smaller values culminate in mode collapse. **Maximal sequence length**: The maximal number of SMILES tokens in the generated molecule. **Number of samples**: How many samples should be generated (between 1 and 50). # Model card -- PaccMannRL **Model Details**: PaccMannRL is a language model for conditional molecular design. It consists of a domain-specific encoder (for protein targets or gene expression profiles) and a generic molecular decoder. Both components are finetuned together using RL to convert the context representation into a molecule with high affinity toward the context (i.e., binding affinity to the protein or high inhibitory effect for the cell profile). **Developers**: Jannis Born, Matteo Manica and colleagues from IBM Research. **Distributors**: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research. **Model date**: Published in 2021. **Model version**: Models trained and distribuetd by the original authors. - **Protein_v0**: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Protein encoder pretrained on 404k proteins from UniProt. Encoder and decoder finetuned on 41 SARS-CoV-2-related protein targets with a binding affinity predictor trained on BindingDB. - **Omic_v0**: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Gene expression encoder pretrained on 12k gene expression profiles from TCGA. Encoder and decoder finetuned on a few hundred cancer cell profiles from GDSC with a IC50 predictor trained on GDSC. **Model type**: A language-based molecular generative model that can be optimized with RL to generate molecules with high affinity toward a context. **Information about training algorithms, parameters, fairness constraints or other applied approaches, and features**: - **Protein**: Parameters as provided on [(GitHub repo)](https://github.com/PaccMann/paccmann_sarscov2). - **Omics**: Parameters as provided on [(GitHub repo)](https://github.com/PaccMann/paccmann_rl). **Paper or other resource for more information**: - **Protein**: [PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning (2021; *iScience*)](https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6). - **Omics**: [Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2 (2021; *Machine Learning: Science and Technology*)](https://iopscience.iop.org/article/10.1088/2632-2153/abe808/meta). **License**: MIT **Where to send questions or comments about the model**: Open an issue on [GT4SD repository](https://github.com/GT4SD/gt4sd-core). **Intended Use. Use cases that were envisioned during development**: Chemical research, in particular drug discovery. **Primary intended uses/users**: Researchers and computational chemists using the model for model comparison or research exploration purposes. **Out-of-scope use cases**: Production-level inference, producing molecules with harmful properties. **Factors**: Not applicable. **Metrics**: High reward on generating molecules with high affinity toward context. **Datasets**: ChEMBL, UniProt, GDSC and BindingDB (see above). **Ethical Considerations**: Unclear, please consult with original authors in case of questions. **Caveats and Recommendations**: Unclear, please consult with original authors in case of questions. Model card prototype inspired by [Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs) ## Citation **Omics**: ```bib @article{born2021paccmannrl, title = {PaccMann\textsuperscript{RL}: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning}, journal = {iScience}, volume = {24}, number = {4}, pages = {102269}, year = {2021}, issn = {2589-0042}, doi = {https://doi.org/10.1016/j.isci.2021.102269}, url = {https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6}, author = {Born, Jannis and Manica, Matteo and Oskooei, Ali and Cadow, Joris and Markert, Greta and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a} } ``` **Proteins**: ```bib @article{born2021datadriven, author = {Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and Mill, Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and Cardinale, Antonio and Laino, Teodoro and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a}, doi = {10.1088/2632-2153/abe808}, issn = {2632-2153}, journal = {Machine Learning: Science and Technology}, number = {2}, pages = {025024}, title = {{Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2}}, url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808}, volume = {2}, year = {2021} } ```