Spaces:

itohtaka
/

my1stspace

Running

App Files Files Community

my1stspace / models /selfies_ted /README.md

Takashi Itoh

Merge models

6c9555d 6 days ago

preview code

raw

history blame contribute delete

2.14 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: feature-extraction
	tags:
	- chemistry
	---

	# selfies-ted

	selfies-ted is a project for encoding SMILES (Simplified Molecular Input Line Entry System) into SELFIES (SELF-referencing Embedded Strings) and generating embeddings for molecular representations.

	![selfies-ted](selfies-ted.png)
	## Model Architecture

	Configuration details

	Encoder and Decoder FFN dimensions: 256
	Number of attention heads: 4
	Number of encoder and decoder layers: 2
	Total number of hidden layers: 6
	Maximum position embeddings: 128
	Model dimension (d_model): 256

	## Pretrained Models and Training Logs
	We provide checkpoints of the selfies-ted model pre-trained on a dataset of molecules curated from PubChem. The pre-trained model shows competitive performance on molecular representation tasks. For model weights: "HuggingFace link".

	To install and use the pre-trained model:

	Download the selfies_ted_model.pkl file from the "HuggingFace link".
	Add the selfies-ted selfies_ted_model.pkl to the models/ directory. The directory structure should look like the following:

	```
	models/
	└── selfies_ted_model.pkl
	```

	## Installation

	To use this project, you'll need to install the required dependencies. We recommend using a virtual environment:

	```bash
	python -m venv venv
	source venv/bin/activate # On Windows use `venv\Scripts\activate`
	```

	Install the required dependencies

	```
	pip install -r requirements.txt
	```


	## Usage

	### Import

	```
	import load
	```
	### Training the Model

	To train the model, use the train.py script:

	```
	python train.py -f <path_to_your_data_file>
	```


	Note: The actual usage may depend on the specific implementation in load.py. Please refer to the source code for detailed functionality.

	### Load the model and tokenizer
	```
	load.load("path/to/checkpoint.pkl")
	```
	### Encode SMILES strings
	```
	smiles_list = ["COC", "CCO"]
	```
	```
	embeddings = load.encode(smiles_list)
	```


	## Example Notebook

	Example notebook of this project is `selfies-ted-example.ipynb`.