BarcodeBERT model trained on all complete DNA sequences from the latest BOLD database release. We used the 'nucraw' column of DNA sequences and followed the preprocessing steps outlined in the BarcodeBERT paper.
The model has been trained for a total of 17 epochs.
Example Usage
from transformers import PreTrainedTokenizerFast, BertForMaskedLM
model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
model.eval()
tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
# The DNA sequence you want to predict.
# There should be a space after every 4 characters.
# The sequence may also have unknown characters which are not A,C,T,G.
# The maximum DNA sequence length (not counting spaces) should be 660 characters
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..."
inputs = tokenizer(dna_sequence, return_tensors="pt")
# Obtain a DNA embedding, which is a vector of length 768.
# The embedding is a representation of this DNA sequence in the model's latent space.
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze()
Results
- Downloads last month
- 17
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.