|
--- |
|
license: cc-by-nc-sa-4.0 |
|
widget: |
|
- text: ACCTGA<mask>TTCTGAGTC |
|
tags: |
|
- DNA |
|
- biology |
|
- genomics |
|
- segmentation |
|
--- |
|
# segment-nt |
|
|
|
SegmentNT is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics |
|
elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These |
|
include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and |
|
tissue-specific promoters and enhancers, and CTCF-bound sites) elements. |
|
|
|
|
|
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) |
|
- **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models](https://www.biorxiv.org/content/biorxiv/early/2024/03/15/2024.03.14.584712.full.pdf) |
|
|
|
### How to use |
|
|
|
<!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them --> |
|
Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models: |
|
```bash |
|
pip install --upgrade git+https://github.com/huggingface/transformers.git |
|
``` |
|
|
|
A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence. |
|
|
|
⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, |
|
SegmentNT-multi-species has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change |
|
the `rescaling_factor` of the Rotary Embedding layer in the esm model `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference |
|
(i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`. |
|
|
|
[![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/InstaDeepAI/segment_nt/blob/main/inference_segment_nt.ipynb) |
|
The `./inference_segment_nt.ipynb` can be run in Google Colab by clicking on the icon and shows how to handle inference on sequence lengths require changing |
|
the rescaling factor and sequence lengths that do not. One can run the notebook and reproduce Fig.1 and Fig.3 from the SegmentNT paper. |
|
|
|
```python |
|
# Load model and tokenizer |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True) |
|
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True) |
|
|
|
# Choose the length to which the input sequences are padded. By default, the |
|
# model max length is chosen, but feel free to decrease it as the time taken to |
|
# obtain the embeddings increases significantly with it. |
|
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by |
|
# 2 to the power of the number of downsampling block, i.e 4. |
|
max_length = 12 + 1 |
|
|
|
assert (max_length - 1) % 4 == 0, ( |
|
"The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by" |
|
"2 to the power of the number of downsampling block, i.e 4.") |
|
|
|
# Create a dummy dna sequence and tokenize it |
|
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"] |
|
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"] |
|
|
|
# Infer |
|
attention_mask = tokens != tokenizer.pad_token_id |
|
outs = model( |
|
tokens, |
|
attention_mask=attention_mask, |
|
output_hidden_states=True |
|
) |
|
|
|
# Obtain the logits over the genomic features |
|
logits = outs.logits.detach() |
|
# Transform them in probabilities |
|
probabilities = torch.nn.functional.softmax(logits, dim=-1) |
|
print(f"Probabilities shape: {probabilities.shape}") |
|
|
|
# Get probabilities associated with intron |
|
idx_intron = model.config.features.index("intron") |
|
probabilities_intron = probabilities[:,:,idx_intron] |
|
print(f"Intron probabilities shape: {probabilities_intron.shape}") |
|
|
|
|
|
``` |
|
|
|
|
|
## Training data |
|
|
|
The **segment-nt** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. |
|
During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by |
|
using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form: |
|
|
|
``` |
|
<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA> |
|
``` |
|
|
|
### Training |
|
|
|
The model was trained on a DGXH100 node with 8 GPUs on a total of 23B tokens for 3 days. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences. |
|
|
|
|
|
### Architecture |
|
|
|
The model is composed of the [nucleotide-transformer-v2-500m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed |
|
the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these |
|
blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters |
|
to 562M. |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{de2024segmentnt, |
|
title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models}, |
|
author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others}, |
|
journal={bioRxiv}, |
|
pages={2024--03}, |
|
year={2024}, |
|
publisher={Cold Spring Harbor Laboratory} |
|
} |
|
|
|
``` |