Continuous Pre-training of DNA Sequence Data Based on the LLaMA Model
We perform continuous pre-training of DNA sequence data based on the LLaMA model. This involves using a comprehensive and diverse dataset to further enhance the model's understanding and representation of genomic information. Specifically:
- DNA Sequence Data:
- We follow the pre-training data approach used by DNABert, extracting fragments of 300 to 1000 base pairs (bp) from multiple model organisms. The total data volume for DNA sequences is approximately 16 GB.
By continuously pre-training the LLaMA model with this DNA sequence data, we ensure that the model remains up-to-date with the latest genomic discoveries and maintains its ability to generalize well across different genomics tasks. This continuous learning process helps to improve the model's accuracy and robustness in handling complex biological sequences.
from transformers import AutoTokenizer, AutoConfig,AutoModel
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import AutoConfig, AutoModelForCausalLM,LlamaForCausalLM,LlamaTokenizer
from tokenizers import Tokenizer
from datasets import load_dataset
tokenizer = LlamaTokenizer.from_pretrained("dnagpt/llama-dna")
tokenizer.pad_token = tokenizer.eos_token
model = LlamaForCausalLM.from_pretrained("dnagpt/llama-dna") #continue pretrain
text='''GCTGACTCTGCCAGGATGGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTAGTCAGACATAGTCACATAGGCAAGTAAGGGAACCTAAAATTGCTTGGAAT,
The primary use of LLaMA is research on large language models, including'''
print(f"Tokenized by DNA-LLaMA tokenizer:{tokenizer.tokenize(text)}")
import torch
from transformers import pipeline
model_id = "dnagpt/llama-dna"
pipe = pipeline(
"text-generation",
model=model_id,
#torch_dtype=torch.bfloat16,
device_map="auto",
)
print(pipe("The key to life is"))
print(pipe("GGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCT"))
- Downloads last month
- 6