Model documentation & parameters

Parameters

Property

The supported properties are:

Metal NonMetal Classifier: Predicted by a RF model (WHICH? )
Metal Semiconductor Classifier: Classifying whether a metal could be a semiconductor. Predicted with CGCNN (ToDo: Add Ref!)
Poisson Ratio: ToDo: Description + Reference
Shear Moduli ...
Bulk Moduli
Fermi Energy
Band Gap
Absolute Energy
Formation Energy

Input file for crystal model

The file with information about the metal. Dependent on the property you want to predict, the format of the file differs:

Metal NonMetal Classifier. It requires a single .csv file with the metal (chemical formula) in the first column and the crystal system in the second.
All others: Predicted with CGCNN. The input can either be a single .cif file (to predict a single metal) or a .zip folder which contains multiple .cif (for batch prediction)

Model card - CGCNN

Model Details: The Regression Transformer is a multitask Transformer that reformulates regression as a conditional sequence modeling task. This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation.

Developers: Jannis Born and Matteo Manica from IBM Research.

Distributors: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.

Model date: Preprint released in 2022, currently under review at Nature Machine Intelligence.

Algorithm version: Models trained and distributed by the original authors.

Molecules: QED: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
Molecules: Solubility: QED model finetuned on the ESOL dataset from Delaney et al (2004), J. Chem. Inf. Comput. Sci. to predict water solubility. Model trained on augmented SELFIES.
Molecules: USPTO: Model trained on 2.8M chemical reactions from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
Molecules: Polymer: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (<conv>), PDI (<pdi>) and Molecular Weight (<molwt>). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. See the example for details.
Molecules: Cosmo_acdl: Model finetuned on 56k molecules with two properties (pKa_ACDL and pKa_COSMO). Model used augmented SELFIES.
Molecules: Pfas: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
Molecules: Logp_and_synthesizability: Model trained on 2.9M molecules (SELFIES) from PubChem with two synthetic properties, the logP (partition coefficient) and the SCScore by Coley et al. (2018); J. Chem. Inf. Model.
Molecules: Crippen_logp: Model trained on 2.9M molecules (SMILES) from PubChem, but only on logP (partition coefficient).
Proteins: Stability: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the Stability dataset from the TAPE benchmark which has ~65k samples.

Model type: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: All models are trained with an alternated training scheme that alternated between optimizing the cross-entropy loss on the property tokens ("regression") or the self-consistency objective on the molecular tokens. See the Regression Transformer paper for details.

Paper or other resource for more information: The Regression Transformer paper. See the source code for details.

License: MIT

Where to send questions or comments about the model: Open an issue on GT4SD repository.

Intended Use. Use cases that were envisioned during development: Chemical research, in particular drug discovery.

Primary intended uses/users: Researchers and computational chemists using the model for model comparison or research exploration purposes.

Out-of-scope use cases: Production-level inference, producing molecules with harmful properties.

Factors: Not applicable.

Metrics: High predictive power for the properties of that specific algorithm version.

Datasets: Different ones, as described under Algorithm version.

Ethical Considerations: No specific considerations as no private/personal data is involved. Please consult with the authors in case of questions.

Caveats and Recommendations: Please consult with original authors in case of questions.

Model card prototype inspired by Mitchell et al. (2019)

Model card - RandomForestMetalClassifier

ToDo...

Citation

@article{manica2022gt4sd,
  title={GT4SD: Generative Toolkit for Scientific Discovery},
  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
  journal={arXiv preprint arXiv:2207.03928},
  year={2022}
}