yarongef
/

DistilProtBert

protein language model

Inference Endpoints

Model card Files Files and versions Community

DistilProtBert / README.md

yarongef's picture

Update README.md

36d6389 over 2 years ago

|

1.4 kB

	---
	license: mit
	language: protein
	tags:
	- protein language model
	datasets:
	- Uniref50
	---

	# DistilProtBert model

	Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_bert).
	In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.

	# Model description

	DistilProtBert was pretrained on millions of proteins sequences.
	This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
	publicly available data) with an automatic process to generate inputs and labels from those protein sequences.

	Few important differences between DistilProtBert model and the original ProtBert version are:
	1. The size of the model
	2. The size of the pretraining dataset
	3. Time & hardware used for pretraining

	## Intended uses & limitations

	The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

	### How to use

	The model can be used the same as ProtBert.

	## Training data

	DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences after length filtering (only sequences of length 20 to 512 amino acid were used).