File size: 6,930 Bytes
76abb29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
base_model: ibm/biomed.sm.mv-te-84m
library_name: SmallMoleculeMultiView
license: apache-2.0
tags:
- chemistry
- model_hub_mixin
- molecules
- multiview
- pytorch
- pytorch_model_hub_mixin
---

# ibm/biomed.sm.mv-te-84m-MoleculeNet-ligand_scaffold-MUV-101
**SmallMoleculeMultiView**, multi-view molecular foundation model.

- **Developers:** IBM Research
- **GitHub Repository:** [https://github.com/BiomedSciAI/biomed.multi-view](https://github.com/BiomedSciAI/biomed.multi-view)
- **Paper:** [Multi-view biomedical foundation models for molecule-target and property prediction](https://arxiv.org/abs/TBD)
- **Release Date**: Oct 29th, 2024
- **License:** [apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).

## Model Description

This model contains the implementation of the Multi-view Molecular Embedding with Late Fusion (MMELON) architecture. MMELON combines molecular representations from three views — image, 2-dimensional chemically-bonded graph, and text (SMILES) —to learn a joint embedding that can be finetuned for downstream tasks in chemical and biological property prediction.

It was introduced in the paper [Multi-view biomedical foundation models for molecule-target and property prediction](https://arxiv.org/) by authors and first released in [this repository](https://github.com/BiomedSciAI/biomed.multi-view).

![SmallMoleculeMultiView Overview](https://github.com/BiomedSciAI/biomed.multi-view/docs/overview.png)

* Image Representation: Captures the 2D visual depiction of molecular structures, highlighting features like symmetry, bond angles, and functional groups. Molecular images are generated using RDKit and undergo data augmentation during training to enhance robustness.
* Graph Representation: Encodes molecules as undirected graphs where nodes represent atoms and edges represent bonds. Atom-specific properties (e.g., atomic number, chirality) and bond-specific properties (e.g., bond type, stereochemistry) are embedded using categorical embedding techniques.
* Text Representation: Utilizes SMILES strings to represent chemical structures, tokenized with a custom tokenizer. The sequences are embedded using a transformer-based architecture to capture the sequential nature of the chemical information.

The embeddings from these single-view pre-trained encoders are combined using an attention-based aggregator module. This module learns to weight each view appropriately, producing a unified multi-view embedding. This approach leverages the strengths of each representation to improve performance on downstream predictive tasks.


## Usage

Using `SmallMoleculeMultiView` requires [https://github.com/BiomedSciAI/biomed.multi-view](https://github.com/BiomedSciAI/biomed.multi-view)

## Installation
Follow these steps to set up the `biomed.multi-view` codebase on your system.

### Prerequisites
* Operating System: Linux or macOS
* Python Version: Python 3.11
* Conda: Anaconda or Miniconda installed
* Git: Version control to clone the repository


### Step 1: Set up the project directory
Choose a root directory where you want to install biomed.multi-view. For example:

```bash
export ROOT_DIR=~/biomed-multiview
mkdir -p $ROOT_DIR
```

### Step 2: Install anaconda3
If you have Anconda in your system you can skip this step.
``` bash
cd $ROOT_DIR
# Download the Anaconda installer
wget https://repo.anaconda.com/archive/Anaconda3-2023.03-Linux-x86_64.sh

# Run the installer
bash Anaconda3-2023.03-Linux-x86_64.sh
# After installation, initialize Conda:
source activate $ROOT_DIR/anaconda3/bin/activate
```

#### Step 3: Create and activate a Conda environment
```bash
conda create -y python=3.11 --prefix $ROOT_DIR/envs/biomed-multiview
```
Activate the environment:
```bash
conda activate $ROOT_DIR/envs/biomed-multiview
```

#### Step 4: Clone the repository
Navigate to the project directory and clone the repository:
```bash
mkdir -p $ROOT_DIR/code
cd $ROOT_DIR/code

# Clone the repository using HTTPS
git clone https://github.com/BiomedSciAI/biomed.multi-view.git

# Navigate into the cloned repository
cd biomed.multi-view
```
Note: If you prefer using SSH, ensure that your SSH keys are set up with GitHub and use the following command:
```bash
git clone git@github.com:BiomedSciAI/biomed.multi-view.git
```

#### Step 5: Install package dependencies
Install the package in editable mode along with development dependencies:
``` bash
pip install -e .['dev']
```
Install additional requirements:
``` bash
pip install -r requirements.txt
```

#### Step 6: macOS-Specific instructions (Apple Silicon)
If you are using a Mac with Apple Silicon (M1/M2/M3) and the zsh shell, you may need to disable globbing for the installation command:

``` bash
noglob pip install -e .[dev]
```
Install macOS-specific requirements optimized for Apple’s Metal Performance Shaders (MPS):
```bash
pip install -r requirements-mps.txt
```

#### Step 7:  Installation verification (optional)
Verify that the installation was successful by running unit tests

```bash
python -m unittest bmfm_sm.tests.all_tests
```


### Get embedding example

A simple example:
```python
# Necessary imports
from bmfm_sm.api.smmv_api import SmallMoleculeMultiViewModel
from bmfm_sm.core.data_modules.namespace import LateFusionStrategy

# Load Model
model = SmallMoleculeMultiViewModel.from_pretrained(
    LateFusionStrategy.ATTENTIONAL,
    model_path="ibm/biomed.sm.mv-te-84m",
    huggingface=True
)

# Load Model and get embeddings for a molecule
example_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
example_emb = SmallMoleculeMultiViewModel.get_embeddings(
    smiles=example_smiles,
    model_path="ibm/biomed.sm.mv-te-84m",
    huggingface=True,
)
print(example_emb.shape)
```

### Get prediction example

``` python
from bmfm_sm.api.smmv_api import SmallMoleculeMultiViewModel
from bmfm_sm.api.dataset_registry import DatasetRegistry

# Initialize the dataset registry
dataset_registry = DatasetRegistry()

# Example SMILES string
example_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"

# Get dataset information for dataset
ds = dataset_registry.get_dataset_info("MUV")

# Load the finetuned model for the dataset
finetuned_model_ds = SmallMoleculeMultiViewModel.from_finetuned(
    ds,
    model_path="ibm/biomed.sm.mv-te-84m-MoleculeNet-ligand_scaffold-MUV-101",
    inference_mode=True,
    huggingface=True
)

# Get predictions
prediction = SmallMoleculeMultiViewModel.get_predictions(
    example_smiles, ds, finetuned_model=finetuned_model_ds
)

print("Prediction:", prediction)
```

##### Output:
```bash
Prediction: {'prediction': [0.85], 'label': None}
```

For more advanced usage, see our detailed examples at: https://github.com/BiomedSciAI/biomed.multi-view


## Citation

If you found our work useful, please consider to give a star to the repo and cite our paper:
```
@article{TBD,
  title={TBD},
  author={IBM Research Team},
  jounal={arXiv preprint arXiv:TBD},
  year={2024}
}
```