Henry Kenlay
commited on
Upload README.md
Browse files
README.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- antibody language model
|
4 |
+
- antibody
|
5 |
+
base_model: Exscientia/IgT5_unpaired
|
6 |
+
license: mit
|
7 |
+
---
|
8 |
+
|
9 |
+
# IgT5 model
|
10 |
+
|
11 |
+
Pretrained model on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper [Large scale paired antibody language models](https://arxiv.org/abs/2403.17889).
|
12 |
+
|
13 |
+
The model is finetuned from IgT5-unpaired using paired antibody sequences from paired OAS.
|
14 |
+
|
15 |
+
# Use
|
16 |
+
|
17 |
+
The encoder part of the model and tokeniser can be loaded using the `transformers` library
|
18 |
+
|
19 |
+
```python
|
20 |
+
from transformers import T5EncoderModel, T5Tokenizer
|
21 |
+
|
22 |
+
tokeniser = T5Tokenizer.from_pretrained("Exscientia/IgT5", do_lower_case=False)
|
23 |
+
model = T5EncoderModel.from_pretrained("Exscientia/IgT5")
|
24 |
+
```
|
25 |
+
|
26 |
+
The tokeniser is used to prepare batch inputs
|
27 |
+
```python
|
28 |
+
# heavy chain sequences
|
29 |
+
sequences_heavy = [
|
30 |
+
"VQLAQSGSELRKPGASVKVSCDTSGHSFTSNAIHWVRQAPGQGLEWMGWINTDTGTPTYAQGFTGRFVFSLDTSARTAYLQISSLKADDTAVFYCARERDYSDYFFDYWGQGTLVTVSS",
|
31 |
+
"QVQLVESGGGVVQPGRSLRLSCAASGFTFSNYAMYWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRTEDTAVYYCASGSDYGDYLLVYWGQGTLVTVSS"
|
32 |
+
]
|
33 |
+
|
34 |
+
# light chain sequences
|
35 |
+
sequences_light = [
|
36 |
+
"EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK",
|
37 |
+
"ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL"
|
38 |
+
]
|
39 |
+
|
40 |
+
# The tokeniser expects input of the form ["V Q ... S S </s> E V ... I K", ...]
|
41 |
+
paired_sequences = []
|
42 |
+
for sequence_heavy, sequence_light in zip(sequences_heavy, sequences_light):
|
43 |
+
paired_sequences.append(' '.join(sequence_heavy)+' </s> '+' '.join(sequence_light))
|
44 |
+
|
45 |
+
tokens = tokeniser.batch_encode_plus(
|
46 |
+
paired_sequences,
|
47 |
+
add_special_tokens=True,
|
48 |
+
pad_to_max_length=True,
|
49 |
+
return_tensors="pt",
|
50 |
+
return_special_tokens_mask=True
|
51 |
+
)
|
52 |
+
```
|
53 |
+
|
54 |
+
Note that the tokeniser adds a `</s>` token at the end of each paired sequence and pads using the `<pad>` token. For example a batch containing sequences `V Q L </s> E V V`, `Q V </s> A L` will be tokenised to `V Q L </s> E V V </S>` and `Q V </s> A L </s> <pad> <pad>`.
|
55 |
+
|
56 |
+
|
57 |
+
Sequence embeddings are generated by feeding tokens through the model
|
58 |
+
|
59 |
+
```python
|
60 |
+
output = model(
|
61 |
+
input_ids=tokens['input_ids'],
|
62 |
+
attention_mask=tokens['attention_mask']
|
63 |
+
)
|
64 |
+
|
65 |
+
residue_embeddings = output.last_hidden_state
|
66 |
+
```
|
67 |
+
|
68 |
+
To obtain a sequence representation, the residue tokens can be averaged over like so
|
69 |
+
|
70 |
+
```python
|
71 |
+
import torch
|
72 |
+
|
73 |
+
# mask special tokens before summing over embeddings
|
74 |
+
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0
|
75 |
+
sequence_embeddings_sum = residue_embeddings.sum(1)
|
76 |
+
|
77 |
+
# average embedding by dividing sum by sequence lengths
|
78 |
+
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1)
|
79 |
+
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1)
|
80 |
+
```
|