File size: 2,671 Bytes
b7679d3 364e71a b7679d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
license: mit
pipeline_tag: mask-generation
tags:
- biology
- metagenomics
- Roberta
---
### Leveraging Large Language Models for Metagenomic Analysis
**Model Overview:**
This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.
**Model Architecture:**
- **Base Model:** RoBERTa transformer architecture
- **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
- **Training:** Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes (Bacteria and Archaea))
**Steps to Use the Model:**
1. **Install KmerTokenizer:**
2. ```sh
pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
```
3. **Example Code:**
```python
from KmerTokenizer import KmerTokenizer
from transformers import AutoModel
import torch
# Example gene sequence
seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
# Initialize the tokenizer
tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
tokenized_output = tokenizer.kmer_tokenize(seq)
pad_token_id = 2 # Set pad token ID
# Create attention mask (1 for tokens, 0 for padding)
attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
# Convert tokenized output to LongTensor and add batch dimension
inputs = torch.tensor([tokenized_output], dtype=torch.long)
# Load the pre-trained BigBird model
model = AutoModel.from_pretrained("MsAlEhR/MetaBerta-400-fragments-18k-genome", output_hidden_states=True)
# Generate hidden states
outputs = model(input_ids=inputs, attention_mask=attention_mask)
# Get embeddings from the last hidden state
embeddings = outputs.hidden_states[-1]
# Expand attention mask to match the embedding dimensions
expanded_attention_mask = attention_mask.unsqueeze(-1)
# Compute mean sequence embeddings
mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)
```
**Citation:**
For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*.
>
> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07. |