yarongef
/

DistilProtBert

protein language model

Inference Endpoints

Model card Files Files and versions Community

DistilProtBert / README.md

yarongef's picture

Update README.md

dc98dd8 over 2 years ago

|

2.74 kB

	---
	license: mit
	language: protein
	tags:
	- protein language model
	datasets:
	- Uniref50
	---

	# DistilProtBert

	Distilled version of [ProtBert-UniRef100](https://huggingface.co/Rostlab/prot_bert) model.
	In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.

	[Git](https://github.com/yarongef/DistilProtBert) repository.

	# Model details
	\| Model \| # of parameters \| # of hidden layers \| Pretraining dataset \| # of proteins \| Pretraining hardware \|
	\|:--------------:\|:-------------------:\|:----------------------:\|:-----------------------:\|:------------------------------:\|:------------------------:\|
	\| ProtBert \| 420M \| 30 \| UniRef100 \| 216M \| 512 16GB TPUs \|
	\| DistilProtBert \| 230M \| 15 \| UniRef50 \| 43M \| 5 v100 32GB GPUs \|

	## Intended uses & limitations

	The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

	### How to use

	The model can be used the same as ProtBert and with ProtBert's tokenizer.

	## Training data

	DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

	# Pretraining procedure

	Preprocessing was done using ProtBert's tokenizer.
	The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)).

	The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.

	## Evaluation results

	When fine-tuned on downstream tasks, this model achieves the following results:

	\| Task/Dataset \| secondary structure (3-states) \| Membrane \|
	\|:-----:\|:-----:\|:-----:\|
	\| CASP12 \| 72 \| \|
	\| TS115 \| 81 \| \|
	\| CB513 \| 79 \| \|
	\| DeepLoc \| \| 86 \|

	Distinguish between proteins and their k-let shuffled versions:

	_Singlet_

	\| Model \| AUC \|
	\|:--------------:\|:-------:\|
	\| LSTM \| 0.71 \|
	\| ProtBert \| 0.93 \|
	\| DistilProtBert \| 0.92 \|

	_Doublet_

	\| Model \| AUC \|
	\|:--------------:\|:-------:\|
	\| LSTM \| 0.68 \|
	\| ProtBert \| 0.92 \|
	\| DistilProtBert \| 0.91 \|

	_Triplet_

	\| Model \| AUC \|
	\|:--------------:\|:-------:\|
	\| LSTM \| 0.61 \|
	\| ProtBert \| 0.92 \|
	\| DistilProtBert \| 0.87 \|