File size: 1,369 Bytes
1011882
 
 
 
8da6636
74db789
 
 
016495d
 
6a6008c
016495d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2659a4
 
 
 
ce45097
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
license: mit
---

[BarcodeBERT](https://arxiv.org/pdf/2311.02401) model trained on all complete DNA sequences from the latest [BOLD database release](http://www.boldsystems.org/index.php/datapackages/Latest). We used the 'nucraw' column of DNA sequences and followed the preprocessing steps outlined in the BarcodeBERT paper.

The model has been trained for a total of 17 epochs.

## Example Usage

```py
from transformers import PreTrainedTokenizerFast, BertForMaskedLM

model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
model.eval()

tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")

# The DNA sequence you want to predict.
# There should be a space after every 4 characters.
# The sequence may also have unknown characters which are not A,C,T,G.
# The maximum DNA sequence length (not counting spaces) should be 660 characters
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..."

inputs = tokenizer(dna_sequence, return_tensors="pt")

# Obtain a DNA embedding, which is a vector of length 768.
# The embedding is a representation of this DNA sequence in the model's latent space.
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze()
```

## Results

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ec809e794d34d1a4379f1f/LpXuOJn7CXR_UnA8sFmK1.png)