Update README.md
Browse files
README.md
CHANGED
@@ -7722,7 +7722,7 @@ model-index:
|
|
7722 |
|
7723 |
`udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
|
7724 |
It is a universal embedding model across tasks, natural and programming languages.
|
7725 |
-
(From
|
7726 |
|
7727 |
<div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
|
7728 |
|
@@ -7742,6 +7742,7 @@ It is a universal embedding model across tasks, natural and programming language
|
|
7742 |
|
7743 |
- **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
|
7744 |
- **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
|
|
|
7745 |
|
7746 |
|
7747 |
|
@@ -7750,6 +7751,37 @@ It is a universal embedding model across tasks, natural and programming language
|
|
7750 |
Use the code below to get started with the model.
|
7751 |
|
7752 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7753 |
|
7754 |
```
|
7755 |
|
@@ -7767,7 +7799,7 @@ Use the code below to get started with the model.
|
|
7767 |
|
7768 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
7769 |
|
7770 |
-
#### Preprocessing
|
7771 |
|
7772 |
MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
|
7773 |
Negatives for SNLI and MultiNLI are randomly sampled.
|
|
|
7722 |
|
7723 |
`udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
|
7724 |
It is a universal embedding model across tasks, natural and programming languages.
|
7725 |
+
(From the technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
|
7726 |
|
7727 |
<div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
|
7728 |
|
|
|
7742 |
|
7743 |
- **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
|
7744 |
- **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
|
7745 |
+
- **Training Date :** 2023-06
|
7746 |
|
7747 |
|
7748 |
|
|
|
7751 |
Use the code below to get started with the model.
|
7752 |
|
7753 |
```python
|
7754 |
+
import torch
|
7755 |
+
from transformers import AutoTokenizer, BloomModel
|
7756 |
+
|
7757 |
+
tokenizer = AutoTokenizer.from_pretrained('izhx/udever-bloom-560m')
|
7758 |
+
model = BloomModel.from_pretrained('izhx/udever-bloom-560m')
|
7759 |
+
|
7760 |
+
boq, eoq, bod, eod = '[BOQ]', '[EOQ]', '[BOD]', '[EOD]'
|
7761 |
+
eoq_id, eod_id = tokenizer.convert_tokens_to_ids([eoq, eod])
|
7762 |
+
|
7763 |
+
if tokenizer.padding_side != 'left':
|
7764 |
+
print('!!!', tokenizer.padding_side)
|
7765 |
+
tokenizer.padding_side = 'left'
|
7766 |
+
|
7767 |
+
|
7768 |
+
def encode(texts: list, is_query: bool = True, max_length=300):
|
7769 |
+
bos = boq if is_query else bod
|
7770 |
+
eos_id = eoq_id if is_query else eod_id
|
7771 |
+
texts = [bos + t for t in texts]
|
7772 |
+
encoding = tokenizer(
|
7773 |
+
texts, truncation=True, max_length=max_length - 1, padding=True
|
7774 |
+
)
|
7775 |
+
for ids, mask in zip(encoding['input_ids'], encoding['attention_mask']):
|
7776 |
+
ids.append(eos_id)
|
7777 |
+
mask.append(1)
|
7778 |
+
inputs = tokenizer.pad(encoding, return_tensors='pt')
|
7779 |
+
with torch.inference_mode():
|
7780 |
+
outputs = model(**inputs)
|
7781 |
+
embeds = outputs.last_hidden_state[:, -1]
|
7782 |
+
return embeds
|
7783 |
+
|
7784 |
+
encode(['I am Bert', 'You are Elmo'])
|
7785 |
|
7786 |
```
|
7787 |
|
|
|
7799 |
|
7800 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
7801 |
|
7802 |
+
#### Preprocessing
|
7803 |
|
7804 |
MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
|
7805 |
Negatives for SNLI and MultiNLI are randomly sampled.
|