genbio-ai
/

AIDO.DNA-7B

Model card Files Files and versions Community

probablybots commited on Dec 7, 2024

Commit

8dfc1ea

·

verified ·

1 Parent(s): 340c96b

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ To test whether representation capacity has limited the development of DNA langu
 To minimize bias and learn high-resolution single-nucleotide dependencies, we opted to align closely with the real data and use character-level tokenization with a 5-letter vocabulary: `A, T, C, G, N`, where `N` is commonly used in gene sequencing to denote uncertain elements. Sequences were also prefixed with a `[CLS]` token and suffixed with a `[EOS]` token as hooks for downstream tasks. We chose a context length of 4,000 nucleotides as the longest context which would fit within DNA FM 7B during pretraining, and chunked our dataset of 796 genomes into non-overlapping segments.
 ## Evaluation of DNA FM 7B
-We evaluate the benefits of pretraining DNA FM 7B by conducting a comprehensive series of experiments related to functional genomics, genome mining, metabolic engineering, synthetic biology, and therapeutics design, covering supervised, unsupervised, and generative objectives. Unless otherwise stated, hyperparameters were determined by optimizing model performance on a 10% validation split of the training data, and models were tested using the checkpoint with the lowest validation loss. For more detailed information, please refer to [our paper](https://openreview.net/forum?id=Kis8tVUeNi&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2024%2FWorkshop%2FAIDrugX%2FAuthors%23your-submissions)).
 ## Results
 <center><img src="circle_benchmarks.png" alt="Downstream results of DNA FM 7B" style="width:70%; height:auto;" /></center>

 To minimize bias and learn high-resolution single-nucleotide dependencies, we opted to align closely with the real data and use character-level tokenization with a 5-letter vocabulary: `A, T, C, G, N`, where `N` is commonly used in gene sequencing to denote uncertain elements. Sequences were also prefixed with a `[CLS]` token and suffixed with a `[EOS]` token as hooks for downstream tasks. We chose a context length of 4,000 nucleotides as the longest context which would fit within DNA FM 7B during pretraining, and chunked our dataset of 796 genomes into non-overlapping segments.
 ## Evaluation of DNA FM 7B
+We evaluate the benefits of pretraining DNA FM 7B by conducting a comprehensive series of experiments related to functional genomics, genome mining, metabolic engineering, synthetic biology, and therapeutics design, covering supervised, unsupervised, and generative objectives. Unless otherwise stated, hyperparameters were determined by optimizing model performance on a 10% validation split of the training data, and models were tested using the checkpoint with the lowest validation loss. For more detailed information, please refer to [our paper](https://doi.org/10.1101/2024.12.01.625444)).
 ## Results
 <center><img src="circle_benchmarks.png" alt="Downstream results of DNA FM 7B" style="width:70%; height:auto;" /></center>