File size: 4,504 Bytes
2a480bc
 
f08d3f7
 
 
 
 
 
 
 
 
 
 
 
 
 
2a480bc
5e08371
f08d3f7
5e08371
f08d3f7
 
 
 
 
313ffc2
577e72b
 
 
 
f08d3f7
313ffc2
5e08371
 
f08d3f7
b84d2d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e08371
 
b84d2d1
 
 
 
 
 
 
 
 
 
5e08371
 
b84d2d1
 
 
 
 
 
 
 
 
 
 
5e08371
 
b84d2d1
 
 
 
 
edc4692
b84d2d1
 
 
 
5e08371
 
b84d2d1
 
 
 
 
f08d3f7
 
 
edc4692
f08d3f7
 
 
edc4692
f08d3f7
edc4692
f08d3f7
 
 
f1f77c7
 
7ea09c7
 
f07aa5b
 
 
f08d3f7
5e08371
f07aa5b
f08d3f7
5e08371
f08d3f7
2617fbd
f08d3f7
 
 
5e08371
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
metrics:
- accuracy
- bleu
pipeline_tag: text2text-generation
tags:
- chemistry
- biology
- medical
- smiles
- iupac
- text-generation-inference
widget:
- text: CCO
  example_title: ethanol
---
# SMILES2IUPAC-canonical-small

SMILES2IUPAC-canonical-small was designed to accurately translate SMILES chemical names to IUPAC standards. 

## Model Details

### Model Description

SMILES2IUPAC-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder. 
- **Developed by:** Knowladgator Engineering
- **Model type:** Encoder-Decoder with attention mechanism
- **Language(s) (NLP):** SMILES, IUPAC (English)
- **License:** Apache License 2.0

### Model Sources
- **Paper:** coming soon
- **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)

## Quickstart
Firstly, install the library:
```commandline
pip install chemical-converters
```
### SMILES to IUPAC
#### ! Preferred IUPAC style
To choose the preferred IUPAC style, place style tokens before 
your SMILES sequence.

| Style Token | Description                                                                                        |
|-------------|----------------------------------------------------------------------------------------------------|
| `<BASE>`    | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
| `<SYST>`    | The totally systematic style without trivial names                                                 |
| `<TRAD>`    | The style is based on trivial names of the parts of substances                                     |

#### To perform simple translation, follow the example:
```python
from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
```
```text
['ethanol']
['ethanol', 'ethanol', 'ethanol']
```
#### Processing in batches:
```python
from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1, 
                                process_in_batch=True, batch_size=1000))
```
```text
['buta-1,3-diene', 'buta-1,3-diene'...]
```
#### Validation SMILES to IUPAC translations
It's possible to validate the translations by reverse translation into IUPAC
and calculating Tanimoto similarity of two molecules fingerprints.
````python
from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac('CCO', validate=True))
````
````text
['ethanol'] 1.0
````
The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.

You can also process validation manually:
```python
from chemicalconverters import NamesConverter

validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
```
```text
1.0
```

## Bias, Risks, and Limitations

This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.

### Training Procedure

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.

## Evaluation

| Model                               | Accuracy | BLEU-4 score | Size(MB) |
|-------------------------------------|---------|------------------|----------|
| SMILES2IUPAC-canonical-small        |75%| 0.93                 | 23       |
| SMILES2IUPAC-canonical-base         |86.9%|0.964|180|
| STOUT V2.0\*                        | 66.65%  | 0.92                 | 128      |
| STOUT V2.0 (according to our tests) |         | 0.89                 | 128      |
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4

## Citation
Coming soon.

## Model Card Authors

[Mykhailo Shtopko](https://huggingface.co/BioMike)

## Model Card Contact

info@knowledgator.com