Spaces:
Sleeping
Sleeping
File size: 6,379 Bytes
c703bc8 63d9b78 c703bc8 63d9b78 c703bc8 63d9b78 c703bc8 63d9b78 c703bc8 63d9b78 c703bc8 63d9b78 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# Model documentation & parameters
## Parameters
### Property
The supported properties are:
- `Metal NonMetal Classifier`: Predicted by a RF model (WHICH? )
- `Metal Semiconductor Classifier`: Classifying whether a metal could be a semiconductor. Predicted with CGCNN (ToDo: Add Ref!)
- `Poisson Ratio`: ToDo: Description + Reference
- `Shear Moduli` ...
- `Bulk Moduli`
- `Fermi Energy`
- `Band Gap`
- `Absolute Energy`
- `Formation Energy`
### Input file for crystal model
The file with information about the metal. Dependent on the property you want to predict, the format of the file differs:
- `Metal NonMetal Classifier`. It requires a single `.csv` file with the metal (chemical formula) in the first column and the crystal system in the second.
- **All others**: Predicted with CGCNN. The input can either be a single `.cif` file (to predict a single metal) or a `.zip` folder which contains multiple `.cif` (for batch prediction)
# Model card - CGCNN
**Model Details**: The [Regression Transformer](https://arxiv.org/abs/2202.01338) is a multitask Transformer that reformulates regression as a conditional sequence modeling task. This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation.
**Developers**: Jannis Born and Matteo Manica from IBM Research.
**Distributors**: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.
**Model date**: Preprint released in 2022, currently under review at *Nature Machine Intelligence*.
**Algorithm version**: Models trained and distributed by the original authors.
- **Molecules: QED**: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
- **Molecules: Solubility**: QED model finetuned on the ESOL dataset from [Delaney et al (2004), *J. Chem. Inf. Comput. Sci.*](https://pubs.acs.org/doi/10.1021/ci034243x) to predict water solubility. Model trained on augmented SELFIES.
- **Molecules: USPTO**: Model trained on 2.8M [chemical reactions](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
- **Molecules: Polymer**: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (`<conv>`), PDI (`<pdi>`) and Molecular Weight (`<molwt>`). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. See the example for details.
- **Molecules: Cosmo_acdl**: Model finetuned on 56k molecules with two properties (*pKa_ACDL* and *pKa_COSMO*). Model used augmented SELFIES.
- **Molecules: Pfas**: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
- **Molecules: Logp_and_synthesizability**: Model trained on 2.9M molecules (SELFIES) from PubChem with **two** synthetic properties, the logP (partition coefficient) and the [SCScore by Coley et al. (2018); *J. Chem. Inf. Model.*](https://pubs.acs.org/doi/full/10.1021/acs.jcim.7b00622?casa_token=JZzOrdWlQ_QAAAAA%3A3_ynCfBJRJN7wmP2gyAR0EWXY-pNW_l-SGwSSU2SGfl5v5SxcvqhoaPNDhxq4THberPoyyYqTZELD4Ck)
- **Molecules: Crippen_logp**: Model trained on 2.9M molecules (SMILES) from PubChem, but *only* on logP (partition coefficient).
- **Proteins: Stability**: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the [**Stability**](https://www.science.org/doi/full/10.1126/science.aan0693) dataset from the [TAPE benchmark](https://proceedings.neurips.cc/paper/2019/hash/37f65c068b7723cd7809ee2d31d7861c-Abstract.html) which has ~65k samples.
**Model type**: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.
**Information about training algorithms, parameters, fairness constraints or other applied approaches, and features**:
All models are trained with an alternated training scheme that alternated between optimizing the cross-entropy loss on the property tokens ("regression") or the self-consistency objective on the molecular tokens. See the [Regression Transformer](https://arxiv.org/abs/2202.01338) paper for details.
**Paper or other resource for more information**:
The [Regression Transformer](https://arxiv.org/abs/2202.01338) paper. See the [source code](https://github.com/IBM/regression-transformer) for details.
**License**: MIT
**Where to send questions or comments about the model**: Open an issue on [GT4SD repository](https://github.com/GT4SD/gt4sd-core).
**Intended Use. Use cases that were envisioned during development**: Chemical research, in particular drug discovery.
**Primary intended uses/users**: Researchers and computational chemists using the model for model comparison or research exploration purposes.
**Out-of-scope use cases**: Production-level inference, producing molecules with harmful properties.
**Factors**: Not applicable.
**Metrics**: High predictive power for the properties of that specific algorithm version.
**Datasets**: Different ones, as described under **Algorithm version**.
**Ethical Considerations**: No specific considerations as no private/personal data is involved. Please consult with the authors in case of questions.
**Caveats and Recommendations**: Please consult with original authors in case of questions.
Model card prototype inspired by [Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs)
# Model card - RandomForestMetalClassifier
ToDo...
# Citation
```bib
@article{manica2022gt4sd,
title={GT4SD: Generative Toolkit for Scientific Discovery},
author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
journal={arXiv preprint arXiv:2207.03928},
year={2022}
}
``` |