README.md · yerevann/chemlactica-125m at main

metadata

license: cc-by-nc-4.0
language:
  - en
library_name: transformers
tags:
  - chemistry
  - biology

Chemlactica-125m is a continually pretrained galactica-125m model for organic molecules. It is pretrained on 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties (molecular weight, synthetic accessibility score, drug-likeness etc.) and similarities (Tanimoto distance between ECFP fingerprints).

Example prompts:

</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS] will attempt to predict the synthetic accessibility score of the given molecule.

</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES] will attempt to generate a molecule that has 2.25 SAS score and has a 0.62 similarity score to the given molecule.

The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts. See the code on GitHub.

A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization and other benchmarks is available on arxiv.

Few notes:

All queries should start with </s> symbol.
All numbers are rounded to two decimal points.
All SMILES are canonicalized using rdkit.
Available tags: [CLOGP], [WEIGHT], [QED], [SAS], [TPSA], [RINGCOUNT], [SIMILAR]...

The model is part of the 3-model family: Chemlactica-125M, Chemlactica-1.3B and Chemma-2B.

We are looking forward to see the community using the model in new applications and contexts.