File size: 2,671 Bytes
b7679d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
364e71a
b7679d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: mit
pipeline_tag: mask-generation
tags:
  - biology
  - metagenomics
  - Roberta
---
### Leveraging Large Language Models for Metagenomic Analysis

**Model Overview:**
This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.

**Model Architecture:**
- **Base Model:** RoBERTa transformer architecture
- **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
- **Training:** Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes  (Bacteria and Archaea))


**Steps to Use the Model:**

1. **Install KmerTokenizer:**

2. ```sh
   pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
    ```
3. **Example Code:**
   ```python
    from KmerTokenizer import KmerTokenizer
    from transformers import AutoModel
    import torch
    
    # Example gene sequence
    seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
    
    # Initialize the tokenizer
    tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
    tokenized_output = tokenizer.kmer_tokenize(seq)
    pad_token_id = 2  # Set pad token ID
    
    # Create attention mask (1 for tokens, 0 for padding)
    attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
    
    # Convert tokenized output to LongTensor and add batch dimension
    inputs = torch.tensor([tokenized_output], dtype=torch.long)
    
    # Load the pre-trained BigBird model
    model = AutoModel.from_pretrained("MsAlEhR/MetaBerta-400-fragments-18k-genome", output_hidden_states=True)
    
    # Generate hidden states
    outputs = model(input_ids=inputs, attention_mask=attention_mask)
    
    # Get embeddings from the last hidden state
    embeddings = outputs.hidden_states[-1]  
    
    # Expand attention mask to match the embedding dimensions
    expanded_attention_mask = attention_mask.unsqueeze(-1) 
    
    # Compute mean sequence embeddings
    mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)

   ```

**Citation:**
For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*.
> 
> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.