Model Card for Model ID

Randomly generate MOFid based on a wide range of nodes, linkers, topology and catenation which can be used for different application opportunities. For example, here we're focusing on predicting the CH4/N2 gas separation performance using the MOF-GRU paper. Give a number as the input to get a total of that many randomly generated unique MOFids.

Model Details

Only LoRA Adapters are provided. Merge with base Llama-3-8b model for inference.

Model Description

This model is a 4-bit quantized, fine-tuned version of Llama3-8b, specialized for generating Metal-Organic Framework (MOF) IDs. It can produce a specified number of random MOFids based on user input. MOFid will be generated based on the general MOFid structure described in this paper

Developed by: The MOF Masters
Shared by: Aritra Roy
Model type: Text Generation
Language(s) (NLP): English
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Finetuned from model: Llama3-8b

Model Sources

Repository: https://huggingface.co/MOFMasters/MOF-Master-Llama-3-8b-Random-MOF-hackathon

Uses

Direct Use

This model is designed to generate random Metal-Organic Framework (MOF) IDs. Users can specify the number of MOFids they want to generate, and the model will produce that many unique identifiers.

Out-of-Scope Use

This model is not designed for tasks other than MOFid generation. It should not be used for general text generation, question answering, or any task unrelated to MOF identification.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Setup

First, make sure you have the required libraries installed:

pip install xformers trl peft accelerate bitsandbytes tqdm python-dotenv wandb scikit-learn rdkit selfies "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Usage

Colab Scripts

Finetuning: https://cutt.ly/finetune-llama3-8b-for-MOFid-generation
Inference & Prediction: https://cutt.ly/generate-MOFid-and-predict-gas-separation

from rdkit import Chem
import re
import pandas as pd
import selfies as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load the base model
from unsloth import FastLanguageModel
max_seq_length = 1024 # Choose any! Unsloth auto-supports RoPE Scaling internally! However, for our work 1024 is more than enough.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = hf_token, # needed for gated models like meta-llama/Llama-3-8b
)

# Load the LoRA adapter using local path
model.load_adapter("LoRA-Llama-3-8b-MOFMaster")

Prompt template and helper function

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

possible_topologies = ['pcu', 'pts', 'fsc', 'lvt', 'fof', 'bcu', 'nbo', 'dia', 'thj', 'sqc', 'hxg', 'moa', 'cds', 'lon', 'uni', 'fel', 'ths', 'sxb', 'sql', 'ssb', 'mmt', 'flu', 'pto', 'asf', 'bsn', 'umx', 'wut', 'dmp', 'vmi', 'una', 'tfz', 'fcu', 'cdl', 'upa', 'xai', 'unc', 'tfs', 'uoc', 'jea', 'moc', 'rnb', 'ptr', 'mco', 'kag', 'bbl', 'jeb', 'und', 'ung', 'ukk', 'bbj', 'mote', 'jsd', 'uml', 'qtz', 'xbe', 'wbl', 'crb', 'icf', 'ato', 'ttp', 'ftw', 'stc', 'unj', 'mou', 'baa', 'tfl', 'tbo', 'tfo', 'ins', 'ske', 'ptt', 'uog', 'unh', 'ume', 'bnn', 'sse', 'rtl', 'qzd', 'sod', 'frl', 'mog', 'rob', 'stj', 'sml', 'bbm', 'xmz', 'itv', 'uot', 'uov', 'bpi', 'uoj', 'nab', 'stu', 'sbr', 'neb', 'pte', 'bba', 'bbd', 'mgg', 'smd', 'bbi', 'coe', 'nog', 'cdm', 'cdle', 'wjh', 'mcn', 'cda', 'noq', 'qdl', 'tfzd', 'sxc', 'sne', 'flue', 'cdz', 'cdn', 'uny', 'nia', 'rna', 'bbk', 'act', 'vby', 'wmi', 'smt', 'umv', 'hmse', 'vbm', 'ithd', 'kto', 'bbe', 'stb', 'snx', 'ctn', 'why', 'unb', 'dmd', 'nom', 'bbh', 'pcl', 'usf', 'atn', 'fry', 'sma', 'eea', 'tfa', 'lcy', 'unp', 'vmd', 'nou', 'scu', 'mot', 'hms', 'the', 'ukm', 'smn', 'tfi', 'rhr', 'umg', 'uom', 'flt', 'nor', 'npo', 'mdf', 'bel', 'crs', 'hyw', 'snk', 'pth', 'fsm', 'wky', 'ssd', 'ssa', 'yug', 'isq', 'fnh', 'tfe', 'bbf', 'sab', 'unm', 'fet', 'tsg', 'wei', 'pcuh', 'uod', 'msp', 'snl', 'zyl', 'umj', 'tfb', 'brl', 'uob', 'fsg', 'los', 'muo', 'vcc', 'snz', 'une', 'fsh', 'smk', 'nox', 'uki', 'mmo', 'sqp', 'sit', 'bbg', 'dft', 'zxc', 'sra', 'ssf', 'pds', 'bik', 'uos', 'gra', 'uoh', 'reo', 'smg', 'ecu', 'isp', 'nts', 'lil', 'spn', 'eta', 'srs', 'bbr', 'uku', 'bco', 'umw', 'mab', 'xux', 'isx', 'acs', 'umc', 'dmc', 'urh', 'unn', 'nfc', 'lcv', 'not', 'skd', 'nat', 'sol', 'vmj', 'llj', 'apo', 'fmj', 'sni', 'smc', 'vbo', 'cag', 'gwg', 'smm', 'hex', 'upb', 'qtzx', 'jbw', 'ket', 'vmk', 'sur', 'tsb', 'uoq', 'sta', 'mer', 'wfa', 'tfg', 'smb', 'qtze', 'ukg', 'cut', 'mmm', 'stw', 'sda', 'lfm', 'fjh', 'gis', 'cus', 'apd', 'tcb', 'wjf', 'btu', 'fsl', 'vmg', 'hcb', 'ksx', 'mok', 'bbx', 'ucn', 'sty', 'bne', 'ukv', 'bbs', 'ttx', 'anh', 'stx', 'gee', 'ofp', 'sow', 'tfc', 'wiv', 'umq', 'lim', 'ant', 'ukc', 'xbq', 'sms', 'zyg', 'csq', 'cml', 'sca', 'cdq', 'ums', 'etb', 'nod', 'mod', 'ile', 'snq', 'wgy', 'lcs', 'kea', 'wia', 'snp', 'phi', 'ntt', 'can', 'cbt', 'smu', 'tfj', 'lig', 'fog', 'oso', 'lqm', 'zec', 'lwg', 'bcq', 'baz', 'umo', 'epz', 'gsi', 'som', 'lni', 'wma', 'znp', 'bpq', 'asv', 'btoe', 'uoe', 'cbn', 'uox', 'tsy', 'bbv', 'vmh', 'uow', 'etbe', 'fvl', 'uoa', 'fvn', 'uol', 'osa', 'cfc', 'ylf', 'wji', 'ukj', 'mjb', 'iss', 'ltj', 'fse', 'pcb', 'tsa', 'ttu', 'qnb', 'bcn', 'uop', 'phw', 'wmf', 'upd', 'unx', 'stp', 'cha', 'deh', 'umm', 'uof', 'spl', 'sno', 'vme', 'fsy', 'ukn', 'bcg', 'cdj', 'urj', 'smj', 'pyr', 'tty', 'umr', 'wmg', 'lone', 'wmc', 'xat', 'utp', 'brk', 'tzs', 'ict', 'cqh', 'phx', 'umu', 'ptsf']
possible_catenations = ['cat0', 'cat1', 'cat3', 'cat5', 'cat2']
possible_linker_elements = ['[=Branch2]', '[Li]', '[Ring1]', '[=S]', '[CH2]', '[=NH0]', '[=Branch3]', '[Cu]', '[=CH1]', '[CH0]', '[Co]', '[#Branch2]', '[=Ring1]', '[Branch3]', '[Branch2]', '[O-1]', '[#Branch1]', '[F]', '[CH3]', '[N+1]', '[P]', '[C]', '[I]', '[=Ring2]', '[S]', '[SH0]', '[N]', '[Si]', '[#C]', '[=C]', '[NH0]', '[Cl]', '[=CH0]', '[#CH0]', '[=O]', '[Ring2]', '[=N+1]', '[Branch1]', '[=N]', '[NH1]', '[OH0]', '[Mn]', '[CH1]', '[Br]', '[=Branch1]', '[IH0]', '[O]', '[none]', '[#N]']
possible_nodes = ['[Cu]1[Cu][Cu][Cu]1', '[Ti]12[O]3[Ti]4[O]2[Ti]2[O]4[Ti]4[O]5[Ti]3[O]1[Ti]5[O]24', '[Tb]12[OH]3[Tb]4[OH]2[Tb]2[OH]1[Tb]3[OH]42', '[Cu][OH]([Cu])[Cu]', '[O][Ni][O]([Ni][O])[Ni][O]', '[Eu]12[OH]3[Eu]4[OH]2[Eu]2[OH]1[Eu]3[OH]42', 'Cl[Cd]Cl', '[Ni][OH2]([Ni])[Ni]', '[O]12[Ti]34[OH]5[Ti]62[OH]2[Ti]71[OH]4[Ti]14[O]3[Ti]35[O]6[Ti]2([O]71)[OH]43', '[Pr]', '[O][Cr][O]([Cr][O])[Cr][O]', '[Fe][O]([Fe])[Fe]', '[Ni][Ni]', '[Fe]', '[Co][OH]([Co])[Co]', '[In][O]1[Mn][O]([Mn]1)[In]', '[Y]', 'O[Cu]', '[Ni]O[Ni]', '[Ni][OH]1[Ni][OH]([Ni])[Ni]2[OH]([Ni]1[OH]2[Ni])[Ni]', '[Ni][O]([Zn])[Zn]', '[Mg][OH2][Mg]', '[Ni][OH2][Ni]', '[O]12[Hf]34[O]5[Hf]62[O]2[Hf]71[O]4[Hf]14[O]3[Hf]35[O]6[Hf]2([O]71)[O]43', '[Co][OH]1[Co][OH]([Co])[Co]2[OH]([Co]1[OH]2[Co])[Co]', '[Nd][Nd]', '[Mg][OH]1[Mg][OH]([Mg]1)[Mg]', '[Zn][O]([Zn])([Zn])[Zn]', '[Al]', '[U]', '[Gd]', '[Cu][OH]1[Cu][OH]([Cu]1)[Cu]', 'Cl[Al]Cl', '[Co][O]([Zn])[Zn]', '[Sr]', '[Fe][Fe]', '[Zn][OH][Zn]', '[Gd]12[OH]3[Gd]4[OH]2[Gd]2[OH]1[Gd]3[OH]42', '[O][Fe][O]([Fe][O])[Fe][O]', '[Lu]', '[O]12[Zr]34[OH]5[Zr]62[OH]2[Zr]71[OH]4[Zr]14[O]3[Zr]35[O]6[Zr]2([O]71)[OH]43', '[Cu]Br', 'F[Al]', '[Zr]', '[Yb]', '[Ce]', '[Pr]12[OH]3[Pr]4[OH]2[Pr]2[OH]1[Pr]3[OH]42', '[Ni][OH][Ni]', '[O]12[Zr]34[OH]5[Ce]62[OH]2[Zr]71[OH]4[Ce]14[O]3[Zr]35[O]6[Zr]2([O]71)[OH]43', '[In]', '[O]12[Zr]34[O]5[Zr]62[O]2[Zr]71[O]4[Zr]14[O]3[Zr]35[O]6[Zr]2([O]71)[O]43', '[Zn][O]([Zn])[Zn]', '[Cr][Cr]', '[Ni]', '[Rb]1[O]2[O]1[Rb]2', '[OH2][La]', '[Mn]', '[Zn][O]([Cd])([Cd])[Cd]', '[Zn][OH]1[Zn][OH]([Zn]1)[Zn]', '[Tm]', '[Cu][Cu][Cu][Cu]', '[Sm]', '[Zn]Br', '[Cu]I', '[Pr][Pr]', '[Mg]', '[Co][OH]([Co][OH]([Co])[Co])[Co]', 'I[Cu]1[Cu][Cu]1(I)(I)I', '[V]1[OH][V][OH][V][OH][V][OH]1', 'Cl[Cd]', '[Ti]', '[Co][OH]1[Co][OH]([Co]1)[Co]', '[Sn][O]1[Sn][O]([Sn]1)[Sn]', 'Cl[La](Cl)Cl', '[Dy][Dy]', '[Mg][OH2][Mg][OH2][Mg]', '[Cu][Cu]', '[Dy]', '[O]12[Hf]34[OH]5[Hf]62[OH]2[Hf]71[OH]4[Hf]14[O]3[Hf]35[O]6[Hf]2([O]71)[OH]43', '[OH2][Lu]', '[Ni][O]([Ni])([Ni])[Ni]', '[Co][OH2][Co]', '[S][Cu][Cu][S]', '[Mn][O]([Mn])[Mn]', '[Cd]', '[Zn]', '[Ni]O[Ni]1O[Ni](O1)O[Ni]', '[Cu][OH]1[Cu][OH]([Cu]1)[Cu]12([OH2][OH2]2)[OH2][OH2]1', 'O1O[Co]1[Co]1OO1', '[Eu]', 'Cl[Zn]', '[Ca]', '[Fe][O]1[Fe][O]([Fe]1)[Fe]', '[Ni][OH2][Ni]1[OH2][Ni][OH2]1', '[Cu][OH][Cu][OH][Cu]', '[La]', 'Cl[Cu]', '[Ni]O[Ni]O[Ni]', '[Co]O[Co]1O[Co](O1)O[Co]', '[Co][O]([Zn])[Co]', 'Cl[Co]Cl', '[Zn][Zn]', '[Tb]', '[Nd]', '[Co][Co]', '[OH2][Gd]', '[Ag]', 'Cl[Mn][Mn]Cl', '[Li]', '[Er][Er]', '[Cu]O[Cu]', '[Sc]', '[Ho]', '[Er]', '[Cd][Cd]', '[Ni][OH]1[Ni][OH]([Ni]1)[Ni]', '[Cu][O]1[Cu][O]([Cu]1)[Cu]', '[Np]1O[Np]O[Np]O1', '[OH2][Ni][OH2][Ni]', '[Mn][Mn]', '[Cu]', '[Y][Y]', '[Co]', '[Zn][OH]([Zn][OH]([Zn])[Zn])[Zn]']

def validate_mof(mof_id, possible_topologies, possible_catenations):
    parts = mof_id.split()
    molecule_part = " ".join(parts[:-1])
    identifier_part = parts[-1]
    signature = "MOFid-v1"
    valid_molecules = []

    try:
        split_index = molecule_part.rfind('.')
        smiles_part = molecule_part[:split_index]
        node = molecule_part[split_index + 1:]
        building_block_part = molecule_part[split_index + 1:]

        try:
            smiles_list = smiles_part.split('.')
            if len(smiles_list) <=2:
                for smile in smiles_list:
                    # Convert SMILES to molecule object with sanitize=False
                    molecule = Chem.MolFromSmiles(smile, sanitize=False)
                    if molecule is not None:
                        Chem.SanitizeMol(molecule)
                        if Chem.SanitizeMol(molecule) == 0:
                            valid_molecules.append(smile)
                    if len(valid_molecules) == 0:
                        continue
                    if len(valid_molecules) <= 2:
                        if ';' in identifier_part:
                            identifier_part = identifier_part.split(';')[0]
                        try:
                            linker_elements = []
                            is_valid_linkers = False
                            for smile in valid_molecules:
                                selfie = sf.encoder(smile)
                                elements = list(sf.split_selfies(selfie))
                                linker_elements.extend(elements)
                            if set(linker_elements).issubset(set(possible_linker_elements)):
                                is_valid_linkers = True
                            format_signature, topology, catenation = identifier_part.split('.')
                            if format_signature == signature and topology in possible_topologies and catenation in possible_catenations and is_valid_linkers and node in possible_nodes:
                                match = re.search(r'\sMOFid-v1', mof_id)
                                if match:
                                    return mof_id
                                else:
                                    print("No space before MOFid-v1")
                            else:
                                print(f"Invalid signature, topology, or catenation: {identifier_part}")
                        except ValueError:
                            print(f"Couldn't split the identifier part into format_signature, topology, catenation: {identifier_part}")

        except ValueError:
            print(f"Couldn't split the smiles part into a list: {smiles_part}")
    except ValueError:
        print(f"Couldn't split the molecule part into SMILES and Building Block: {molecule_part}")
    return None

Generate and validate MOFids

# Read the training dataset for validating generated MOFids
df = pd.read_csv('train_dataset.csv')

# Prepare the input for the model
instruction = "You are a random MOF predictor. The general structure of the MOFid is-[SMILES code of 1st organic linker].[SMILES code of 2nd organic linker or [none]].[inorganic building block] MOFid-v1.[topology code].[catenation type]."
user_input = "Generate a random MOFid with a maximum of two organic linkers"

inputs = tokenizer(
    [alpaca_prompt.format(instruction=instruction, input=user_input, output="")],
    return_tensors="pt"
).to(device)

# Generate MOF IDs
valid_mofs = []
while True:
    output = model.generate(**inputs, max_new_tokens=256)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    mof_id = response.split("### Response:\n")[-1].strip()

    # Check if MOFid already exists in the database or valid_mofs list
    if mof_id in df['MOF_ID'].values or mof_id in valid_mofs:
        print("MOF already exists in the DataFrame. Generating another one...")
        continue

    # Check chemical validity of the MOF
    if validate_mof(mof_id, possible_topologies, possible_catenations):
        valid_mofs.append(mof_id)
        print(f"Valid MOF generated: {mof_id}")
        if len(valid_mofs) == 25:
            break
    else:
        print("Invalid MOF generated. Trying again...")

# Print the generated valid 25 unique MOFs
for index, mof in enumerate(valid_mofs):
    print(f"{index+1}. {mof}")

Training Details

Training Data

Training data was prepared containing more than 110k datasets provided in the MOF-GRU paper.

Training Procedure

The model was fine-tuned using the unsloth library on an A100 GPU provided by King's College London, UK. The fine-tuning process took 8 hours and achieved a final loss of 0.67.

Preprocessing

All the SMILES for the linkers are coverted into SELFIES using the selfies python library. All the possible linking elements, nodes, topologies and catenations are embedded through a vector embedding process (from MOF-GRU paper).

Training Hyperparameters

Training regime: Mixed precision (fp16 or bf16, depending on hardware support)
Optimizer: AdamW (8-bit)
Learning rate: 2e-4
Batch size: 128 per device
Number of epochs: 1
Weight decay: 0.01
Warmup ratio: 0.1
Learning rate schedule: Linear
Max sequence length: 1024
Gradient checkpointing: "unsloth" (optimized for very long context and 30% less VRAM usage)
Random seed: 3407

Speeds, Sizes, Times

Total training time: 3 hours
Hardware: A100 GPU
Checkpointing:
- Save strategy: Steps
- Save steps: 50
- Save total limit: 5
Logging frequency: Every 5 steps
Final loss: 0.074
Model size: 4-bit quantized version of Llama3-8b

train/loss plot from wandb while fine-tuning

train/loss vs step plot from wandb while fine-tuning

Evaluation

As it's a random generation-based model the evaluation of the response was done using selfies, rdkit python libraries for the linkers to check whether it's a valid organic molecule or not and for other it was checked whether it's is in our database or not.

Results

Followings are two randomly generated MOFids-

[O-1]C(=O)C1=CC(=CC(=C1)C(=O)[O-1])C2=CC=C(C=C2)C3=CC=C(C=C3)C4=CC(=CC(=C4)C(=O)[O-1])C(=O)[O-1].[O-1]C(=O)C1=CC2=C(C=C1C(=O)[O-1])C=C3C(=C2C4=CC=C[NH1]4)C=C(C(=C3C(C)C)C(=O)[O-1])C(=O)[O-1].[Fe] MOFid-v1.pts.cat0

CC(=O)C1=CC(=CC(=C1C(=O)[O-1])C(=O)C)C2=CC=C(C=C2)C3=CC(C(=O)C)=C(C(=C3)C(=O)C)C(=O)[O-1].[O-1]C(=O)C#CC#CC(=O)[O-1].[Cu][Cu] MOFid-v1.nbo.cat0

Citation

We have a plan for advancing this hackathon project in a scientific paper. Until then please use this repository URL as the reference.

Model Card Authors

Aritra Roy, Piyush R. Maharana, Tarak Nath Das

Model Card Contact

Aritra Roy

MOFMasters
/

MOF-Master-Llama-3-8b-Random-MOF-hackathon

You need to agree to share your contact information to access this model