Noelia Ferruz commited on
Commit
a230244
·
1 Parent(s): 006ad59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -53,8 +53,7 @@ python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.tx
53
  The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
54
 
55
  ### **How to select the best sequences**
56
- We've observed that perplexity values correlate with AlphaFold2's plddt. This plot shows perplexity vs. pldtt values for each of the 10,000 sequences in the ProtGPT2-generated dataset (see https://huggingface.co/nferruz/ProtGPT2/blob/main/ppl-plddt.png)
57
-
58
  We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
59
 
60
  ```
@@ -64,8 +63,7 @@ results = perplexity.compute(predictions=predictions, model_id='nferruz/ProtGPT2
64
  ```
65
 
66
  Where `predictions` is a list containing the generated sequences.
67
- As a rule of thumb, sequences with perplexity values below 72 are more likely to have plddt values in line with natural sequences.
68
-
69
 
70
 
71
  ### **Training specs**
 
53
  The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
54
 
55
  ### **How to select the best sequences**
56
+ We've observed that perplexity values correlate with AlphaFold2's plddt.
 
57
  We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
58
 
59
  ```
 
63
  ```
64
 
65
  Where `predictions` is a list containing the generated sequences.
66
+ We do not yet have a threshold as of what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).
 
67
 
68
 
69
  ### **Training specs**