File size: 2,140 Bytes
6c9555d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---

license: apache-2.0
library_name: transformers
pipeline_tag: feature-extraction
tags:
- chemistry
---


# selfies-ted

selfies-ted is a project for encoding SMILES (Simplified Molecular Input Line Entry System) into SELFIES (SELF-referencing Embedded Strings) and generating embeddings for molecular representations.

![selfies-ted](selfies-ted.png)
## Model Architecture

Configuration details

Encoder and Decoder FFN dimensions: 256
Number of attention heads: 4
Number of encoder and decoder layers: 2
Total number of hidden layers: 6
Maximum position embeddings: 128
Model dimension (d_model): 256



## Pretrained Models and Training Logs

We provide checkpoints of the selfies-ted model pre-trained on a dataset of molecules curated from PubChem. The pre-trained model shows competitive performance on molecular representation tasks. For model weights: "HuggingFace link".



To install and use the pre-trained model:



Download the selfies_ted_model.pkl file from the "HuggingFace link".

Add the selfies-ted selfies_ted_model.pkl to the models/ directory. The directory structure should look like the following:



```

models/

└── selfies_ted_model.pkl

```



## Installation



To use this project, you'll need to install the required dependencies. We recommend using a virtual environment:



```bash

python -m venv venv

source venv/bin/activate  # On Windows use `venv\Scripts\activate`

```



Install the required dependencies



```

pip install -r requirements.txt

```





## Usage



### Import



```

import load

```

### Training the Model



To train the model, use the train.py script:



```

python train.py -f <path_to_your_data_file>

```





Note: The actual usage may depend on the specific implementation in load.py. Please refer to the source code for detailed functionality.



### Load the model and tokenizer

```

load.load("path/to/checkpoint.pkl")

```

### Encode SMILES strings

```

smiles_list = ["COC", "CCO"]
```

```
embeddings = load.encode(smiles_list)

```





## Example Notebook



Example notebook of this project is `selfies-ted-example.ipynb`.