jannisborn's picture
update
fcca9f9 unverified
|
raw
history blame
5.73 kB

Model documentation & parameters

Algorithm Version: Which model version (either protein-target-driven or gene-expression-profile-driven) to use and which checkpoint to rely on.

Inference type: Whether the model should be conditioned on the target (default) or whether the model is used in an Unbiased manner.

Protein target: An AAS of a protein target used for conditioning. Only use if Inference type is Conditional and if the Algorithm version is a Protein model.

Gene expression target: A list of 2128 floats, representing the embedding of gene expression profile to be used for conditioning. Only use if Inference type is Conditional and if the Algorithm version is a Omic model.

Decoding temperature: The temperature parameter in the SMILES/SELFIES decoder. Higher values lead to more explorative choices, smaller values culminate in mode collapse.

Maximal sequence length: The maximal number of SMILES tokens in the generated molecule.

Number of samples: How many samples should be generated (between 1 and 50).

Model card -- PaccMannRL

Model Details: PaccMannRL is a language model for conditional molecular design. It consists of a domain-specific encoder (for protein targets or gene expression profiles) and a generic molecular decoder. Both components are finetuned together using RL to convert the context representation into a molecule with high affinity toward the context (i.e., binding affinity to the protein or high inhibitory effect for the cell profile).

Developers: Jannis Born, Matteo Manica and colleagues from IBM Research.

Distributors: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.

Model date: Published in 2021.

Model version: Models trained and distribuetd by the original authors.

  • Protein_v0: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Protein encoder pretrained on 404k proteins from UniProt. Encoder and decoder finetuned on 41 SARS-CoV-2-related protein targets with a binding affinity predictor trained on BindingDB.
  • Omic_v0: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Gene expression encoder pretrained on 12k gene expression profiles from TCGA. Encoder and decoder finetuned on a few hundred cancer cell profiles from GDSC with a IC50 predictor trained on GDSC.

Model type: A language-based molecular generative model that can be optimized with RL to generate molecules with high affinity toward a context.

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features:

Paper or other resource for more information:

License: MIT

Where to send questions or comments about the model: Open an issue on GT4SD repository.

Intended Use. Use cases that were envisioned during development: Chemical research, in particular drug discovery.

Primary intended uses/users: Researchers and computational chemists using the model for model comparison or research exploration purposes.

Out-of-scope use cases: Production-level inference, producing molecules with harmful properties.

Factors: Not applicable.

Metrics: High reward on generating molecules with high affinity toward context.

Datasets: ChEMBL, UniProt, GDSC and BindingDB (see above).

Ethical Considerations: Unclear, please consult with original authors in case of questions.

Caveats and Recommendations: Unclear, please consult with original authors in case of questions.

Model card prototype inspired by Mitchell et al. (2019)

Citation

Omics:

@article{born2021paccmannrl,
  title = {PaccMann\textsuperscript{RL}: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning},
  journal = {iScience},
  volume = {24},
  number = {4},
  pages = {102269},
  year = {2021},
  issn = {2589-0042},
  doi = {https://doi.org/10.1016/j.isci.2021.102269},
  url = {https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6},
  author = {Born, Jannis and Manica, Matteo and Oskooei, Ali and Cadow, Joris and Markert, Greta and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a}
}

Proteins:

@article{born2021datadriven,
  author = {Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and Mill, Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and Cardinale, Antonio and Laino, Teodoro and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a},
  doi = {10.1088/2632-2153/abe808},
  issn = {2632-2153},
  journal = {Machine Learning: Science and Technology},
  number = {2},
  pages = {025024},
  title = {{Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2}},
  url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808},
  volume = {2},
  year = {2021}
}