thrumbel commited on
Commit
d29c533
1 Parent(s): b4e48c0

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. README.md +202 -0
  2. config.json +18 -0
  3. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: ibm/biomed.sm.mv-te-84m
3
+ library_name: SmallMoleculeMultiView
4
+ license: apache-2.0
5
+ tags:
6
+ - chemistry
7
+ - model_hub_mixin
8
+ - molecules
9
+ - multiview
10
+ - pytorch
11
+ - pytorch_model_hub_mixin
12
+ ---
13
+
14
+ # ibm/biomed.sm.mv-te-84m-MoleculeNet-ligand_scaffold-BACE-101
15
+ **SmallMoleculeMultiView**, multi-view molecular foundation model.
16
+
17
+ - **Developers:** IBM Research
18
+ - **GitHub Repository:** [https://github.com/BiomedSciAI/biomed.multi-view](https://github.com/BiomedSciAI/biomed.multi-view)
19
+ - **Paper:** [Multi-view biomedical foundation models for molecule-target and property prediction](https://arxiv.org/abs/TBD)
20
+ - **Release Date**: Oct 29th, 2024
21
+ - **License:** [apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).
22
+
23
+ ## Model Description
24
+
25
+ This model contains the implementation of the Multi-view Molecular Embedding with Late Fusion (MMELON) architecture. MMELON combines molecular representations from three views — image, 2-dimensional chemically-bonded graph, and text (SMILES) —to learn a joint embedding that can be finetuned for downstream tasks in chemical and biological property prediction.
26
+
27
+ It was introduced in the paper [Multi-view biomedical foundation models for molecule-target and property prediction](https://arxiv.org/) by authors and first released in [this repository](https://github.com/BiomedSciAI/biomed.multi-view).
28
+
29
+ ![SmallMoleculeMultiView Overview](https://github.com/BiomedSciAI/biomed.multi-view/docs/overview.png)
30
+
31
+ * Image Representation: Captures the 2D visual depiction of molecular structures, highlighting features like symmetry, bond angles, and functional groups. Molecular images are generated using RDKit and undergo data augmentation during training to enhance robustness.
32
+ * Graph Representation: Encodes molecules as undirected graphs where nodes represent atoms and edges represent bonds. Atom-specific properties (e.g., atomic number, chirality) and bond-specific properties (e.g., bond type, stereochemistry) are embedded using categorical embedding techniques.
33
+ * Text Representation: Utilizes SMILES strings to represent chemical structures, tokenized with a custom tokenizer. The sequences are embedded using a transformer-based architecture to capture the sequential nature of the chemical information.
34
+
35
+ The embeddings from these single-view pre-trained encoders are combined using an attention-based aggregator module. This module learns to weight each view appropriately, producing a unified multi-view embedding. This approach leverages the strengths of each representation to improve performance on downstream predictive tasks.
36
+
37
+
38
+ ## Usage
39
+
40
+ Using `SmallMoleculeMultiView` requires [https://github.com/BiomedSciAI/biomed.multi-view](https://github.com/BiomedSciAI/biomed.multi-view)
41
+
42
+ ## Installation
43
+ Follow these steps to set up the `biomed.multi-view` codebase on your system.
44
+
45
+ ### Prerequisites
46
+ * Operating System: Linux or macOS
47
+ * Python Version: Python 3.11
48
+ * Conda: Anaconda or Miniconda installed
49
+ * Git: Version control to clone the repository
50
+
51
+
52
+ ### Step 1: Set up the project directory
53
+ Choose a root directory where you want to install biomed.multi-view. For example:
54
+
55
+ ```bash
56
+ export ROOT_DIR=~/biomed-multiview
57
+ mkdir -p $ROOT_DIR
58
+ ```
59
+
60
+ ### Step 2: Install anaconda3
61
+ If you have Anconda in your system you can skip this step.
62
+ ``` bash
63
+ cd $ROOT_DIR
64
+ # Download the Anaconda installer
65
+ wget https://repo.anaconda.com/archive/Anaconda3-2023.03-Linux-x86_64.sh
66
+
67
+ # Run the installer
68
+ bash Anaconda3-2023.03-Linux-x86_64.sh
69
+ # After installation, initialize Conda:
70
+ source activate $ROOT_DIR/anaconda3/bin/activate
71
+ ```
72
+
73
+ #### Step 3: Create and activate a Conda environment
74
+ ```bash
75
+ conda create -y python=3.11 --prefix $ROOT_DIR/envs/biomed-multiview
76
+ ```
77
+ Activate the environment:
78
+ ```bash
79
+ conda activate $ROOT_DIR/envs/biomed-multiview
80
+ ```
81
+
82
+ #### Step 4: Clone the repository
83
+ Navigate to the project directory and clone the repository:
84
+ ```bash
85
+ mkdir -p $ROOT_DIR/code
86
+ cd $ROOT_DIR/code
87
+
88
+ # Clone the repository using HTTPS
89
+ git clone https://github.com/BiomedSciAI/biomed.multi-view.git
90
+
91
+ # Navigate into the cloned repository
92
+ cd biomed.multi-view
93
+ ```
94
+ Note: If you prefer using SSH, ensure that your SSH keys are set up with GitHub and use the following command:
95
+ ```bash
96
+ git clone git@github.com:BiomedSciAI/biomed.multi-view.git
97
+ ```
98
+
99
+ #### Step 5: Install package dependencies
100
+ Install the package in editable mode along with development dependencies:
101
+ ``` bash
102
+ pip install -e .['dev']
103
+ ```
104
+ Install additional requirements:
105
+ ``` bash
106
+ pip install -r requirements.txt
107
+ ```
108
+
109
+ #### Step 6: macOS-Specific instructions (Apple Silicon)
110
+ If you are using a Mac with Apple Silicon (M1/M2/M3) and the zsh shell, you may need to disable globbing for the installation command:
111
+
112
+ ``` bash
113
+ noglob pip install -e .[dev]
114
+ ```
115
+ Install macOS-specific requirements optimized for Apple’s Metal Performance Shaders (MPS):
116
+ ```bash
117
+ pip install -r requirements-mps.txt
118
+ ```
119
+
120
+ #### Step 7: Installation verification (optional)
121
+ Verify that the installation was successful by running unit tests
122
+
123
+ ```bash
124
+ python -m unittest bmfm_sm.tests.all_tests
125
+ ```
126
+
127
+
128
+ ### Get embedding example
129
+
130
+ A simple example:
131
+ ```python
132
+ # Necessary imports
133
+ from bmfm_sm.api.smmv_api import SmallMoleculeMultiViewModel
134
+ from bmfm_sm.core.data_modules.namespace import LateFusionStrategy
135
+
136
+ # Load Model
137
+ model = SmallMoleculeMultiViewModel.from_pretrained(
138
+ LateFusionStrategy.ATTENTIONAL,
139
+ model_path="ibm/biomed.sm.mv-te-84m",
140
+ huggingface=True
141
+ )
142
+
143
+ # Load Model and get embeddings for a molecule
144
+ example_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
145
+ example_emb = SmallMoleculeMultiViewModel.get_embeddings(
146
+ smiles=example_smiles,
147
+ model_path="ibm/biomed.sm.mv-te-84m",
148
+ huggingface=True,
149
+ )
150
+ print(example_emb.shape)
151
+ ```
152
+
153
+ ### Get prediction example
154
+
155
+ ``` python
156
+ from bmfm_sm.api.smmv_api import SmallMoleculeMultiViewModel
157
+ from bmfm_sm.api.dataset_registry import DatasetRegistry
158
+
159
+ # Initialize the dataset registry
160
+ dataset_registry = DatasetRegistry()
161
+
162
+ # Example SMILES string
163
+ example_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
164
+
165
+ # Get dataset information for dataset
166
+ ds = dataset_registry.get_dataset_info("Dataset(dataset_name='BACE', num_tasks=1, task_type=<TaskType.CLASSIFICATION: 'classification'>, description='MoleculeNet: Inhibition of human beta secretase 1', preferred_metric=<Metrics.ROCAUC: 'rocauc'>, path='datasets/raw_data/MoleculeNet/bace.csv', example='CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,0', collection=<DatasetCollection.MOLECULENET: 'MoleculeNet'>, num_classes=2)")
167
+
168
+ # Load the finetuned model for the dataset
169
+ finetuned_model_ds = SmallMoleculeMultiViewModel.from_finetuned(
170
+ ds,
171
+ model_path="ibm/biomed.sm.mv-te-84m-MoleculeNet-ligand_scaffold-BACE-101",
172
+ inference_mode=True,
173
+ huggingface=True
174
+ )
175
+
176
+ # Get predictions
177
+ prediction = SmallMoleculeMultiViewModel.get_predictions(
178
+ example_smiles, ds, finetuned_model=finetuned_model_ds
179
+ )
180
+
181
+ print("Prediction:", prediction)
182
+ ```
183
+
184
+ ##### Output:
185
+ ```bash
186
+ Prediction: {'prediction': [0.85], 'label': None}
187
+ ```
188
+
189
+ For more advanced usage, see our detailed examples at: https://github.com/BiomedSciAI/biomed.multi-view
190
+
191
+
192
+ ## Citation
193
+
194
+ If you found our work useful, please consider to give a star to the repo and cite our paper:
195
+ ```
196
+ @article{TBD,
197
+ title={TBD},
198
+ author={IBM Research Team},
199
+ jounal={arXiv preprint arXiv:TBD},
200
+ year={2024}
201
+ }
202
+ ```
config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "agg_arch": "coeff_mlp",
3
+ "agg_gate_input": "TBD",
4
+ "agg_weight_freeze": "unfrozen",
5
+ "dropout_prob": 0.2,
6
+ "head": "mlp",
7
+ "hidden_dims": [
8
+ 512,
9
+ 384
10
+ ],
11
+ "inference_mode": true,
12
+ "input_dim": 512,
13
+ "num_classes_per_task": 1,
14
+ "num_tasks": 1,
15
+ "softmax": false,
16
+ "task_type": "classification",
17
+ "use_norm": false
18
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23cce3e412516076b41365aef00ce5f7a9df7e0f85b2e736275e096d1dede410
3
+ size 338415884