dnagpt/llama-dna · Hugging Face

Continuous Pre-training of DNA Sequence Data Based on the LLaMA Model

We perform continuous pre-training of DNA sequence data based on the LLaMA model. This involves using a comprehensive and diverse dataset to further enhance the model's understanding and representation of genomic information. Specifically:

DNA Sequence Data:
- We follow the pre-training data approach used by DNABert, extracting fragments of 300 to 1000 base pairs (bp) from multiple model organisms. The total data volume for DNA sequences is approximately 16 GB.

By continuously pre-training the LLaMA model with this DNA sequence data, we ensure that the model remains up-to-date with the latest genomic discoveries and maintains its ability to generalize well across different genomics tasks. This continuous learning process helps to improve the model's accuracy and robustness in handling complex biological sequences.

from transformers import AutoTokenizer, AutoConfig,AutoModel
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import  AutoConfig, AutoModelForCausalLM,LlamaForCausalLM,LlamaTokenizer
from tokenizers import Tokenizer
from datasets import load_dataset

tokenizer = LlamaTokenizer.from_pretrained("dnagpt/llama-dna")
tokenizer.pad_token = tokenizer.eos_token

model = LlamaForCausalLM.from_pretrained("dnagpt/llama-dna") #continue pretrain

text='''GCTGACTCTGCCAGGATGGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTAGTCAGACATAGTCACATAGGCAAGTAAGGGAACCTAAAATTGCTTGGAAT,
The primary use of LLaMA is research on large language models, including'''
print(f"Tokenized by DNA-LLaMA tokenizer:{tokenizer.tokenize(text)}")


import torch
from transformers import pipeline

model_id = "dnagpt/llama-dna"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    #torch_dtype=torch.bfloat16, 
    device_map="auto",
)

print(pipe("The key to life is"))

print(pipe("GGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCT"))