nferruz
/

ProtGPT2

Noelia Ferruz commited on Aug 25, 2022

Commit

a230244

1 Parent(s): 006ad59

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -53,8 +53,7 @@ python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.tx
 The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
 ### **How to select the best sequences**
-We've observed that perplexity values correlate with AlphaFold2's plddt. This plot shows perplexity vs. pldtt values for each of the 10,000 sequences in the ProtGPT2-generated dataset (see https://huggingface.co/nferruz/ProtGPT2/blob/main/ppl-plddt.png)
 We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
 ```
@@ -64,8 +63,7 @@ results = perplexity.compute(predictions=predictions, model_id='nferruz/ProtGPT2
 ```
 Where `predictions` is a list containing the generated sequences.
-As a rule of thumb, sequences with perplexity values below 72 are more likely to have plddt values in line with natural sequences.
 ### **Training specs**

 The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
 ### **How to select the best sequences**
+We've observed that perplexity values correlate with AlphaFold2's plddt.
 We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
 ```
 ```
 Where `predictions` is a list containing the generated sequences.
+We do not yet have a threshold as of what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).
 ### **Training specs**