Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -43,30 +43,50 @@ model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust
|
|
43 |
```
|
44 |
|
45 |
## Embed entire datasets with no new code
|
46 |
-
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
|
|
|
|
|
47 |
```python
|
48 |
-
|
49 |
-
sequences=
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
|
|
|
|
|
|
|
|
|
|
57 |
)
|
|
|
|
|
58 |
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
```
|
71 |
|
72 |
## Fine-tuning with 🤗 peft
|
|
|
43 |
```
|
44 |
|
45 |
## Embed entire datasets with no new code
|
46 |
+
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
|
47 |
+
|
48 |
+
Example:
|
49 |
```python
|
50 |
+
embedding_dict = model.embed_dataset(
|
51 |
+
sequences=[
|
52 |
+
'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
|
53 |
+
],
|
54 |
+
batch_size=2, # adjust for your GPU memory
|
55 |
+
max_len=512, # adjust for your needs
|
56 |
+
full_embeddings=False, # if True, no pooling is performed
|
57 |
+
embed_dtype=torch.float32, # cast to what dtype you want
|
58 |
+
pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
|
59 |
+
num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
|
60 |
+
sql=False, # if True, embeddings will be stored in SQLite database
|
61 |
+
sql_db_path='embeddings.db',
|
62 |
+
save=True, # if True, embeddings will be saved as a .pth file
|
63 |
+
save_path='embeddings.pth',
|
64 |
)
|
65 |
+
# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
|
66 |
+
```
|
67 |
|
68 |
+
```
|
69 |
+
model.embed_dataset()
|
70 |
+
Args:
|
71 |
+
sequences: List of protein sequences
|
72 |
+
batch_size: Batch size for processing
|
73 |
+
max_len: Maximum sequence length
|
74 |
+
full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
|
75 |
+
pooling_type: Type of pooling ('mean' or 'cls')
|
76 |
+
num_workers: Number of workers for data loading, 0 for the main process
|
77 |
+
sql: Whether to store embeddings in SQLite database - will be stored in float32
|
78 |
+
sql_db_path: Path to SQLite database
|
79 |
+
|
80 |
+
Returns:
|
81 |
+
Dictionary mapping sequences to embeddings, or None if sql=True
|
82 |
+
|
83 |
+
Note:
|
84 |
+
- If sql=True, embeddings can only be stored in float32
|
85 |
+
- sql is ideal if you need to stream a very large dataset for training in real-time
|
86 |
+
- save=True is ideal if you can store the entire embedding dictionary in RAM
|
87 |
+
- sql will be used if it is True and save is True or False
|
88 |
+
- If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
|
89 |
+
- Sequences will be truncated to max_len and sorted by length in descending order for faster processing
|
90 |
```
|
91 |
|
92 |
## Fine-tuning with 🤗 peft
|