# **ESM-Scan**
Calculate the fitness of single amino acid substitutions on proteins, using a [zero-shot](https://doi.org/10.1101/2021.07.09.450648) [language model predictor](https://github.com/facebookresearch/esm)
USAGE INSTRUCTIONS
### **Setup**
No setup is required, just fill the input boxes with the required data and click on the `Run` button.
A list of examples can be found at the bottom of the page, click on them to autofill the fields.
If the server is not used for some time, it will go into standby.
Running a calculation resumes the tool from standby, the first run might take longer due to startup and model loading.
### **Input**
- write the protein full amino acid sequence to be analysed in the **Sequence** text box
jolly charachters (e.g. `-X.B`) can be inserted but, at the moment, visualisation cannot handle them
- write the substitutions to test in the **Substitutions** box
there are three running modes that can be used, depending on the input:
+ *single substitution* or list thereof (in the form of `R218K R218W`): the single substitution is scored
+ *residue position* or list thereof: all possible substitutions will be evaluated
+ *same-length sequence*: the differing amino acid substitutions will be evaluated, one by one
+ any other *different input*: a deep mutational scan of the full sequence will be performed
- the ESM model to use for the calculations can be chosen among those that are available on Hugging Face Model Hub;
`esm2_t33_650M_UR50D` offers the best expense-accuracy tradeoff[*](https://doi.org/10.1126/science.ade2574)
- the `masked-marginals` scoring strategy considers sequence context at inference time, being slower but more accurate;
in case of long runtimes, you can tick the box off to speed the calculations up significantly, sacrificing accuracy
- when running a deep mutational scan, it is recommended to use smaller models (8M, 35M, 150M parameters), since the runtime is significant, especially for longer sequences and the server might be overloaded;
over 30 min might be necessary for calculating a 300-residue-long sequence with larger models
in general, accuracy is influenced significantly by the scoring strategy and less so by the model size, so it is suggested to reduce the latter first when optimising for runtime;
the scoring strategy computational cost scales with the number of substitutions tested, while the model’s with the wild-type sequence length
- it is possible to calculate the effect of multiple concurrent substitutions, but this has to be done manually, by changing the input sequence and running the calculation again
### **Output**
Your results will be shown in a color-coded table, except for the deep mutational scan which will yield a heatmap.
The output data can be downloaded from the box at the bottom.
File extensions are not supported by the server and need to be appended to the filenames after downloading:
- `CSV` for tables
- `SVG` for full-sequence deep mutational scan