Edit model card

Leveraging Large Language Models for Metagenomic Analysis

Model Overview: The model presented in this paper is based on the RoBERTa transformer with a similar approach to optimize and find the best BigBird model for large gene sequence architecture. It is trained specifically on gene sequences. This model aims to uncover insights within metagenomic data and is evaluated on various tasks such as classification and sequence embedding.

Model Architecture:

  • Base Model: BigBird transformer architecture
  • Tokenizer: Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
  • Training: Trained on a diverse dataset of 497 genes from 2000 bacterial and archaeal genomes
  • Embeddings: Generates sequence embeddings using both mean and max pooling of hidden states

Dataset: Details of the dataset will be shared in the supplementary materials of the paper. The dataset includes a comprehensive collection of gene sequences from various metagenomic sources.

Steps to Use the Model:

  1. Install KmerTokenizer: Download KmerTokenizer separately from the following repository: KmerTokenizer Repository

  2. Example Code:

    from KmerTokenizer import KmerTokenizer
    from transformers import AutoModel
    import torch 
    
    # Example gene sequence
    seq_list = ["ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"]
    
    # Initialize the tokenizer
    tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=4096)
    tokenized_output = tokenizer.kmer_tokenize(seq_list)
    
    # Convert tokenized output to tensor
    inputs = torch.tensor(tokenized_output)
    
    # Load the pre-trained BigBird model
    model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
    
    # Generate hidden states
    hidden_states = model(inputs)[0]
    
    # Compute mean and max pooling of the hidden states
    embedding_mean = torch.mean(hidden_states[-1], dim=1)
    embedding_max = torch.max(hidden_states[-1], dim=1)
    

Citation: For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:

Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (Year). Leveraging Large Language Models for Metagenomic Analysis. IEEE.

Downloads last month
21
Safetensors
Model size
35.8M params
Tensor type
F32
·
Inference API (serverless) does not yet support transformers models for this pipeline type.