Synthyra
/

ESMplusplus_small

Fill-Mask

Transformers

Safetensors

ESMplusplus

custom_code

Model card Files Files and versions Community

lhallee commited on 3 days ago

Commit

e655f5a

verified ·

1 Parent(s): cd0641b

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +41 -21

README.md CHANGED Viewed

@@ -43,30 +43,50 @@ model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust
 ```
 ## Embed entire datasets with no new code
-To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 ```python
-embeddings = model.embed_dataset(
-    sequences=sequences, # list of protein strings
-    batch_size=16, # embedding batch size
-    max_len=2048, # truncate to max_len
-    full_embeddings=True, # return residue-wise embeddings
-    full_precision=False, # store as float32
-    pooling_type='mean', # use mean pooling if protein-wise embeddings
-    num_workers=0, # data loading num workers
-    sql=False, # return dictionary of sequences and embeddings
 )
-_ = model.embed_dataset(
-    sequences=sequences, # list of protein strings
-    batch_size=16, # embedding batch size
-    max_len=2048, # truncate to max_len
-    full_embeddings=True, # return residue-wise embeddings
-    full_precision=False, # store as float32
-    pooling_type='mean', # use mean pooling if protein-wise embeddings
-    num_workers=0, # data loading num workers
-    sql=True, # store sequences in local SQL database
-    sql_db_path='embeddings.db', # path to .db file of choice
-)
 ```
 ## Fine-tuning with 🤗 peft

 ```
 ## Embed entire datasets with no new code
+To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
+Example:
 ```python
+embedding_dict = model.embed_dataset(
+    sequences=[
+        'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
+    ],
+    batch_size=2, # adjust for your GPU memory
+    max_len=512, # adjust for your needs
+    full_embeddings=False, # if True, no pooling is performed
+    embed_dtype=torch.float32, # cast to what dtype you want
+    pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
+    num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
+    sql=False, # if True, embeddings will be stored in SQLite database
+    sql_db_path='embeddings.db',
+    save=True, # if True, embeddings will be saved as a .pth file
+    save_path='embeddings.pth',
 )
+# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
+```
+```
+model.embed_dataset()
+Args:
+    sequences: List of protein sequences
+    batch_size: Batch size for processing
+    max_len: Maximum sequence length
+    full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
+    pooling_type: Type of pooling ('mean' or 'cls')
+    num_workers: Number of workers for data loading, 0 for the main process
+    sql: Whether to store embeddings in SQLite database - will be stored in float32
+    sql_db_path: Path to SQLite database
+Returns:
+    Dictionary mapping sequences to embeddings, or None if sql=True
+Note:
+    - If sql=True, embeddings can only be stored in float32
+    - sql is ideal if you need to stream a very large dataset for training in real-time
+    - save=True is ideal if you can store the entire embedding dictionary in RAM
+    - sql will be used if it is True and save is True or False
+    - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
+    - Sequences will be truncated to max_len and sorted by length in descending order for faster processing
 ```
 ## Fine-tuning with 🤗 peft