Chemlactica-125m is a continually pretrained galactica-125m model for organic molecules. It is pretrained on 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties (molecular weight, synthetic accessibility score, drug-likeness etc.) and similarities (Tanimoto distance between ECFP fingerprints).

Example prompts:

</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS] will attempt to predict the synthetic accessibility score of the given molecule.

</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES] will attempt to generate a molecule that has 2.25 SAS score and has a 0.62 similarity score to the given molecule.

The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts. See the code on GitHub.

A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization and other benchmarks is available on arxiv.

Few notes:

  • All queries should start with </s> symbol.
  • All numbers are rounded to two decimal points.
  • All SMILES are canonicalized using rdkit.
  • Available tags: [CLOGP], [WEIGHT], [QED], [SAS], [TPSA], [RINGCOUNT], [SIMILAR]...

The model is part of the 3-model family: Chemlactica-125M, Chemlactica-1.3B and Chemma-2B.

We are looking forward to see the community using the model in new applications and contexts.

Downloads last month
416
Safetensors
Model size
125M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including yerevann/chemlactica-125m