lhallee commited on
Commit
e655f5a
·
verified ·
1 Parent(s): cd0641b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +41 -21
README.md CHANGED
@@ -43,30 +43,50 @@ model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust
43
  ```
44
 
45
  ## Embed entire datasets with no new code
46
- To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 
 
47
  ```python
48
- embeddings = model.embed_dataset(
49
- sequences=sequences, # list of protein strings
50
- batch_size=16, # embedding batch size
51
- max_len=2048, # truncate to max_len
52
- full_embeddings=True, # return residue-wise embeddings
53
- full_precision=False, # store as float32
54
- pooling_type='mean', # use mean pooling if protein-wise embeddings
55
- num_workers=0, # data loading num workers
56
- sql=False, # return dictionary of sequences and embeddings
 
 
 
 
 
57
  )
 
 
58
 
59
- _ = model.embed_dataset(
60
- sequences=sequences, # list of protein strings
61
- batch_size=16, # embedding batch size
62
- max_len=2048, # truncate to max_len
63
- full_embeddings=True, # return residue-wise embeddings
64
- full_precision=False, # store as float32
65
- pooling_type='mean', # use mean pooling if protein-wise embeddings
66
- num_workers=0, # data loading num workers
67
- sql=True, # store sequences in local SQL database
68
- sql_db_path='embeddings.db', # path to .db file of choice
69
- )
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
 
72
  ## Fine-tuning with 🤗 peft
 
43
  ```
44
 
45
  ## Embed entire datasets with no new code
46
+ To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
47
+
48
+ Example:
49
  ```python
50
+ embedding_dict = model.embed_dataset(
51
+ sequences=[
52
+ 'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
53
+ ],
54
+ batch_size=2, # adjust for your GPU memory
55
+ max_len=512, # adjust for your needs
56
+ full_embeddings=False, # if True, no pooling is performed
57
+ embed_dtype=torch.float32, # cast to what dtype you want
58
+ pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
59
+ num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
60
+ sql=False, # if True, embeddings will be stored in SQLite database
61
+ sql_db_path='embeddings.db',
62
+ save=True, # if True, embeddings will be saved as a .pth file
63
+ save_path='embeddings.pth',
64
  )
65
+ # embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
66
+ ```
67
 
68
+ ```
69
+ model.embed_dataset()
70
+ Args:
71
+ sequences: List of protein sequences
72
+ batch_size: Batch size for processing
73
+ max_len: Maximum sequence length
74
+ full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
75
+ pooling_type: Type of pooling ('mean' or 'cls')
76
+ num_workers: Number of workers for data loading, 0 for the main process
77
+ sql: Whether to store embeddings in SQLite database - will be stored in float32
78
+ sql_db_path: Path to SQLite database
79
+
80
+ Returns:
81
+ Dictionary mapping sequences to embeddings, or None if sql=True
82
+
83
+ Note:
84
+ - If sql=True, embeddings can only be stored in float32
85
+ - sql is ideal if you need to stream a very large dataset for training in real-time
86
+ - save=True is ideal if you can store the entire embedding dictionary in RAM
87
+ - sql will be used if it is True and save is True or False
88
+ - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
89
+ - Sequences will be truncated to max_len and sorted by length in descending order for faster processing
90
  ```
91
 
92
  ## Fine-tuning with 🤗 peft