Genomic_context_bert
This model is a pre-trained version of BERT model on Bacterial genomes.
Model description
The model is based on the BERT-base architecture and was pre-trained with the following configurations: 12 hidden layers, a hidden size of 512, and 8 attention heads. Pre-training was conducted using self-supervised masked language modeling (MLM) as the objective, with a 20% token masking probability. The model was trained on approximately 30,000 bacterial genomes using 8 Tesla V100-SXM2-32GB GPUs over a 24-hour period. This configuration enables the model to learn contextual embeddings that capture information from the genomic neighborhood of genes, providing a foundation for downstream analyses.
Intended uses & limitations
The model we trained is a BERT-base architecture pre-trained from scratch using approximately 30,000 bacterial genomes. The primary intended use of this model is to generate contextual embeddings of bacterial proteins based on the genomic neighborhood of the gene encoding the protein. These embeddings capture contextual information from the surrounding genomic sequences, which may reflect functional or biological signals.
The main limitation of this model is that it has been pre-trained exclusively on bacterial genomes and lacks fine-tuning with a specific classification head. Consequently, it cannot directly perform tasks such as functional prediction or classification out-of-the-box. Instead, it serves as a tool for generating contextual representations, which can be further analyzed or utilized in downstream applications, where these embeddings may provide valuable functional insights when paired with additional training or analysis.
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 256
- eval_batch_size: 128
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 2048
- total_eval_batch_size: 1024
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 15
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.2702 | 1.0 | 14395 | 1.1699 |
0.9079 | 2.0 | 28790 | 0.8665 |
0.7738 | 3.0 | 43185 | 0.7505 |
0.6959 | 4.0 | 57580 | 0.6820 |
0.6327 | 5.0 | 71975 | 0.6302 |
0.5899 | 6.0 | 86370 | 0.5956 |
0.5462 | 7.0 | 100765 | 0.5654 |
0.5155 | 8.0 | 115160 | 0.5395 |
0.4836 | 9.0 | 129555 | 0.5149 |
0.4633 | 10.0 | 143950 | 0.4984 |
0.441 | 11.0 | 158345 | 0.4774 |
0.4212 | 12.0 | 172740 | 0.4641 |
0.404 | 13.0 | 187135 | 0.4479 |
0.3883 | 14.0 | 201530 | 0.4401 |
0.3781 | 15.0 | 215925 | 0.4333 |
Framework versions
- Transformers 4.39.2
- Pytorch 2.2.2+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2
- Downloads last month
- 22
Model tree for Dauka-transformers/interpro_bert_2
Base model
google-bert/bert-base-uncased