File size: 2,861 Bytes
8d40725
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdfa1dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: apache-2.0
datasets:
- jxie/guacamol
- AdrianM0/MUV
library_name: transformers
---
## Model Details

We introduce a suite of neural language model tools for pre-training, fine-tuning SMILES-based molecular language models. Furthermore, we also provide recipes for semi-supervised recipes for fine-tuning these languages in low-data settings using Semi-supervised learning. 

### Enumeration-aware Molecular Transformers
Introduces contrastive learning alongside multi-task regression, and masked language modelling as pre-training objectives to inject enumeration knowledge into pre-trained language models.
#### a. Molecular Domain Adaptation (Contrastive Encoder-based)
##### i. Architecture
![smole bert drawio](https://user-images.githubusercontent.com/6007894/233776921-41667331-1ab7-413c-92f7-4e6fad512f5c.svg)
##### ii. Contrastive Learning
<img width="1418" alt="Screenshot 2023-04-22 at 11 54 23 AM" src="https://user-images.githubusercontent.com/6007894/233777069-439c18cc-77a2-4ae2-a81e-d7e94c30a6be.png">

#### b. Canonicalization Encoder-decoder (Denoising Encoder-decoder)
<img width="702" alt="Screenshot 2023-04-22 at 11 43 06 AM" src="https://user-images.githubusercontent.com/6007894/233776512-ab6cdeef-02f1-4076-9b76-b228cbf26456.png">

### Pretraining steps for this model:

- Pretrain BERT model with Masked language modeling with masked proportion set to 15% on Guacamol datasetFore more details please see our [github repository](https://github.com/uds-lsv/enumeration-aware-molecule-transformers).

- ### Virtual Screening Benchmark ([Github Repository](https://github.com/MoleculeTransformers/rdkit-benchmarking-platform-transformers))

original version presented in
S. Riniker, G. Landrum, J. Cheminf., 5, 26 (2013),
DOI: 10.1186/1758-2946-5-26,
URL: http://www.jcheminf.com/content/5/1/26

extended version presented in
S. Riniker, N. Fechner, G. Landrum, J. Chem. Inf. Model., 53, 2829, (2013),
DOI: 10.1021/ci400466r,
URL: http://pubs.acs.org/doi/abs/10.1021/ci400466r

## Model List

Our released models are listed as following. You can import these models by using the `smiles-featurizers` package or using [HuggingFace's Transformers](https://github.com/huggingface/transformers).
| Model | Type |AUROC| BEDROC|
|:-------------------------------|:--------:|:--------:|:--------:|
| [UdS-LSV/smole-bert](https://huggingface.co/UdS-LSV/smole-bert) | `Bert`|0.615 | 0.225 |
| [UdS-LSV/smole-bert-mtr](https://huggingface.co/UdS-LSV/smole-bert-mtr) | `Bert`|0.621 | 0.262 |
| [UdS-LSV/smole-bart](https://huggingface.co/UdS-LSV/smole-bart) | `Bart`|0.660 | 0.263 |
| [UdS-LSV/muv2x-simcse-smole-bart](https://huggingface.co/UdS-LSV/muv2x-simcse-smole-bert) | `Simcse`|0.697 | 0.270 |
| [UdS-LSV/siamese-smole-bert-muv-1x](https://huggingface.co/UdS-LSV/siamese-smole-bert-muv-1x) | `SentenceTransformer`|0.673 | 0.274 |