File size: 5,406 Bytes
be4fdf4 1cb1b74 be4fdf4 6f68f4d be4fdf4 1cb1b74 be4fdf4 1cb1b74 be4fdf4 1cb1b74 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
---
# whaleloops/phrase-bert
This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper.
## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Our model is tested on pytorch=1.9.0, tranformers=4.8.1, sentence-tranformers = 2.1.0 TODO
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
model = SentenceTransformer('whaleloops/phrase-bert')
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs
```
As in sentence-BERT, the default output is a list of numpy arrays:
````
for phrase, embedding in zip(phrase_list, phrase_embs):
print("Phrase:", phrase)
print("Embedding:", embedding)
print("")
````
An example of computing the dot product of phrase embeddings:
````
import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')
````
An example of computing cosine similarity of phrase embeddings:
````
import torch
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')
````
The output should look like:
````
The dot product between phrase 1 and 2 is: 218.43600463867188
The dot product between phrase 1 and 3 is: 165.48483276367188
The dot product between phrase 2 and 3 is: 160.51708984375
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
The cosine similarity between phrase 2 and 3 is: 0.584893524646759
````
## Evaluation
Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:
* Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ]
* BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)]
* PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)]
* PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)]
* PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ]
Change `config/model_path.py` with the model path according to your directories and
* For evaluation on Turney, run `python eval_turney.py`
* For evaluation on BiRD, run `python eval_bird.py`
* for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with:
````
nohup python -u eval_ppdb_paws.py \
--full_run_mode \
--task <task-name> \
--data_dir <input-data-dir> \
--result_dir <result-storage-dr> \
>./output.txt 2>&1 &
````
## Train your own Phrase-BERT
If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to
`phrase-bert/phrase_bert_finetune.py`
The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv).
To re-produce the trained Phrase-BERT, please run:
export INPUT_DATA_PATH=<directory-of-phrasebert-finetuning-data>
export TRAIN_DATA_FILE=<training-data-filename.csv>
export VALID_DATA_FILE=<validation-data-filename.csv>
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens
export OUTPUT_MODEL_PATH=<directory-of-saved-model>
python -u phrase_bert_finetune.py \
--input_data_path $INPUT_DATA_PATH \
--train_data_file $TRAIN_DATA_FILE \
--valid_data_file $VALID_DATA_FILE \
--input_model_path $INPUT_MODEL_PATH \
--output_model_path $OUTPUT_MODEL_PATH
## Citation:
Please cite us if you find this useful:
````
@inproceedings{phrasebertwang2021,
author={Shufan Wang and Laure Thompson and Mohit Iyyer},
Booktitle = {Empirical Methods in Natural Language Processing},
Year = "2021",
Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
}
````
|