yarongef
/

DistilProtBert

protein language model

Inference Endpoints

Model card Files Files and versions Community

DistilProtBert / README.md

yarongef's picture

Update README.md

d88d839 almost 3 years ago

|

1.37 kB

	---
	license: mit
	language: protein
	tags:
	- protein language model
	datasets:
	- Uniref50
	---

	# DistilProtBert model

	Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_bert).
	In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.

	# Model description

	DistilProtBert was pretrained on millions of proteins sequences.

	Few important differences between DistilProtBert model and the original ProtBert version are:
	1. The size of the model
	2. The size of the pretraining dataset
	3. Time & hardware used for pretraining

	## Intended uses & limitations

	The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

	### How to use

	The model can be used the same as ProtBert.

	## Training data

	DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

	# Pretraining procedure

	Preprocessing was done using ProtBert's tokenizer.
	The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)).