ESM-Scan

Calculate the fitness of single amino acid substitutions on proteins, using a zero-shot language model predictor

If you use this tool in your research, please cite:

Totaro, M.G. (2023). “ESM-Scan - a tool to guide amino acid substitutions.” bioRxiv. doi.org/10.1101/2023.12.12.571273
Meier, J. (2021). “Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein Function.” bioRxiv (Cold Spring Harbor Laboratory), July. doi.org/10.1101/2021.07.09.450648

USAGE INSTRUCTIONS

Setup

No setup is required, just fill the input boxes with the required data and click on the Run button.
A list of examples can be found at the bottom of the page, click on them to autofill the fields.
If the server is not used for some time, it will go into standby.
Running a calculation resumes the tool from standby, the first run might take longer due to startup and model loading.

Input

write the protein full amino acid sequence to be analysed in the Sequence text box
jolly charachters (e.g. -X.B) can be inserted but, at the moment, visualisation cannot handle them
write the substitutions to test in the Substitutions box
there are three running modes that can be used, depending on the input:
- single substitution or list thereof (in the form of R218K R218W): the single substitution is scored
- residue position or list thereof: all possible substitutions will be evaluated
- same-length sequence: the differing amino acid substitutions will be evaluated, one by one
- any other different input: a deep mutational scan of the full sequence will be performed
the ESM model to use for the calculations can be chosen among those that are available on Hugging Face Model Hub; esm2_t33_650M_UR50D offers the best expense-accuracy tradeoff*
the more accurate masked-marginals scoring strategy considers sequence context during inferences, increasing the runtime significantly; if the wait is too long, you can tick the box off to speed the calculations, sacrificing accuracy
when running a deep mutational scan, it is recommended to use smaller models (8M, 35M, 150M parameters), since the runtime is significant, especially for longer sequences and the server might be overloaded;
over 30 min might be necessary for calculating a 300-residue-long sequence with larger models
in general, accuracy is influenced more by the scoring strategy and less so by the model size, so it is suggested to reduce the latter first when optimising for runtime;
the scoring strategy computational cost scales with the number of substitutions tested, while the model’s with the wild-type sequence length
it is possible to calculate the effect of multiple concurrent substitutions, but this has to be done manually, by changing the input sequence and running the calculation again

Output

Your results will be shown in a color-coded table, except for the deep mutational scan which will yield a heatmap. The output data can be downloaded from the box at the bottom.
File extensions are not supported by the server and need to be appended to the filenames after downloading:

CSV for tables
SVG for full-sequence deep mutational scan