RaphaelMourad commited on
Commit
23936be
·
verified ·
1 Parent(s): fa64459

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -3
README.md CHANGED
@@ -1,3 +1,61 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pretrained
5
+ - mistral
6
+ - DNA
7
+ - bacteriophage
8
+ - biology
9
+ - genomics
10
+ ---
11
+
12
+ # Model Card for Mistral-DNA-v1-138M-bacteriophage (mistral for DNA)
13
+
14
+ The Mistral-DNA-v1-138M-bacteriophage Large Language Model (LLM) is a pretrained generative DNA text model with 17.31M parameters x 8 experts = 138.5M parameters.
15
+ It is derived from Mistral-7B-v0.1 model, which was simplified for DNA: the number of layers and the hidden size were reduced.
16
+ The model was pretrained using 30405 bacterophage genomes > 10kb.
17
+
18
+ We used "RefSeq Phage FASTA File" database from https://phagescope.deepomics.org/download.
19
+
20
+ For full details of this model please read our [github repo](https://github.com/raphaelmourad/Mistral-DNA).
21
+
22
+ ## Model Architecture
23
+
24
+ Like Mistral-7B-v0.1, it is a transformer model, with the following architecture choices:
25
+ - Grouped-Query Attention
26
+ - Sliding-Window Attention
27
+ - Byte-fallback BPE tokenizer
28
+
29
+ ## Load the model from huggingface:
30
+
31
+ ```
32
+ import torch
33
+ from transformers import AutoTokenizer, AutoModel
34
+
35
+ tokenizer = AutoTokenizer.from_pretrained("RaphaelMourad/Mistral-DNA-v1-138M-bacteriophage", trust_remote_code=True) # Same as DNABERT2
36
+ model = AutoModel.from_pretrained("RaphaelMourad/Mistral-DNA-v1-138M-bacteriophage", trust_remote_code=True)
37
+ ```
38
+
39
+ ## Calculate the embedding of a DNA sequence
40
+
41
+ ```
42
+ dna = "TGATGATTGGCGCGGCTAGGATCGGCT"
43
+ inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
44
+ hidden_states = model(inputs)[0] # [1, sequence_length, 256]
45
+
46
+ # embedding with max pooling
47
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
48
+ print(embedding_max.shape) # expect to be 256
49
+ ```
50
+
51
+ ## Troubleshooting
52
+
53
+ Ensure you are utilizing a stable version of Transformers, 4.34.0 or newer.
54
+
55
+ ## Notice
56
+
57
+ Mistral-DNA-v1-138M-bacteriophage is a pretrained base model for bacteriophage genomes.
58
+
59
+ ## Contact
60
+
61
+ Raphaël Mourad. raphael.mourad@univ-tlse3.fr