--- datasets: - lamm-mit/protein_secondary_structure_from_PDB library_name: transformers base_model: lamm-mit/BioinspiredLlama-3-1-8B-128k --- # Predict dominant secondary structure from protein sequence This model is instruction-tuned on top of ```lamm-mit/BioinspiredLlama-3-1-8B-128k``` to predict the dominant secondary structure, based on the input of an amino acid sequence. Training script in Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure/blob/main/Fine_tune_BioinspiredLlama_3_1_Colab.ipynb) Sample instruction: ```raw Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E > ``` Response: ```raw AH ``` Raw format of training data (in Llama 3.1 chat template format): ```raw <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUNSTRUCTURED<|eot_id|> ``` Here is a visual representation of what the model predicts: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/lLCUOd3E8tC0LtwHRSOfP.png) ## How to load the model ``` import torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure', trust_remote_code=True, device_map="auto", torch_dtype =torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',) ``` Load in 4 bit quantization: ``` base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k" model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure' bnb_config4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, use_nested_quant = False, ) model = AutoModelForCausalLM.from_pretrained( base_model_name, trust_remote_code=True, device_map="auto", quantization_config= bnb_config4bit, torch_dtype =torch.bfloat16, ) model = PeftModel.from_pretrained(model, new_model, ) ``` ## Example Inference function for convenience: ``` def generate_response (text_input="What is spider silk?", system_prompt='You are a biological materials scientist.', num_return_sequences=1, temperature=1., #the higher the temperature, the more creative the model becomes max_new_tokens=127,device='cuda', num_beams=1,eos_token_id= [ 128001, 128008, 128009 ], top_k = 50, top_p =0.9, repetition_penalty=1.1, messages=[], ): if messages==[]: messages=[{"role": "system", "content":system_prompt}, {"role": "user", "content":text_input}] else: messages.append ({"role": "user", "content":text_input}) text_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text_input], add_special_tokens =True, return_tensors ='pt' ).to(device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, num_beams=num_beams, top_k = top_k,eos_token_id=eos_token_id, top_p =top_p, num_return_sequences = num_return_sequences, do_sample =True, repetition_penalty=repetition_penalty, ) outputs=outputs[:, inputs["input_ids"].shape[1]:] return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages ``` Usage: ``` AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E' answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1) print (f"Prediction: {answer[0]}") ``` The output is: ```raw Prediction: AH ``` A visualization of the protein, to check: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/aO-0BbS8Sp_dV796w-Hm0.png) As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical. ## Notes This model has been trained using QLoRA, on sequences shorter than 128 amino acids. Training set and split: ``` max_seq_length = 128 # Dataset dataset = load_dataset('lamm-mit/protein_secondary_structure_from_PDB')['train'] dataset = dataset.filter(lambda example: example['Sequence_length'] < max_seq_length) # Rename columns dataset = dataset.rename_column('Sequence_spaced', 'question') dataset = dataset.rename_column('Primary_SS_Type', 'answer') dataset = dataset.train_test_split(test_size=0.1, seed=42) ... ``` Test accuracy: 77.23% ## Reference ```bibtex @article{Buehler_2024, title={Fine-tuned LLMs for protein and other molecular feature predictions}, author={Markus J. Buehler}, journal={}, year={2024} } ```