File size: 4,418 Bytes
8789f66 cc2a881 8789f66 afb04e2 8789f66 afb04e2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: mit
language:
- en
tags:
- sentence-embedding
- sentence-similarity
- transformers
- feature-extraction
pipeline_tag: sentence-similarity
---
# Phi-2-Text-Embedding-cft
## Description
This is a fine-tuned version of [Phi-2](https://huggingface.co/microsoft/phi-2) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. The paper can be found [here](https://arxiv.org/abs/2408.00690).
## Base Model
[Phi-2](https://huggingface.co/microsoft/phi-2)
## Usage
1. Clone Phi-2 repository
```bash
git clone https://huggingface.co/microsoft/phi-2
```
2. Change a tokenizer setting in `tokenizer_config.json`
```json
"add_eos_token": true
```
3. Use the model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
class PhiSentenceEmbedding:
def __init__(self, model_path='microsoft/phi-2', adapter_path=None):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(model_path,
torch_dtype=torch.bfloat16,
device_map='cuda',
trust_remote_code=True)
if adapter_path != None:
# Load fine-tuned LoRA
self.model.load_adapter(adapter_path)
def get_last_hidden_state(self, text):
inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
with torch.no_grad():
out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
return out.squeeze().float().cpu().numpy()
def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
"""
Returns a list of embeddings for the given sentences.
Args:
sentences: List of sentences to encode
Returns:
List of embeddings for the given sentences
"""
out = []
for s in sentences:
out.append(self.get_last_hidden_state(s))
return out
phi_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/Phi-2-Text-Embedding-cft')
example_sentences = ["I don't like apples", "I like apples"]
encoded_sentences = phi_sentence_embedding.encode(example_sentences)
print(encoded_sentences)
```
## Training Details
| **Training Details** | **Value** |
|-------------------------|-------------------|
| Loss | InfoNCE |
| Batch Size | 60 |
| InfoNCE Temperature | 0.05 |
| Learning Rate | 5e-05 |
| Warmup Steps | 100 |
| Learning Rate Scheduler | CosineAnnealingLR |
| LoRA Rank | 8 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.1 |
| Training Precision | bf16 |
| Max Epoch | 1 |
| GPU | RTX3090 |
| Num GPUs | 4 |
## Training Scripts
The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).
## Checkpoints
We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/Phi-2-Text-Embedding-cft-checkpoints).
## Evaluation Results
| **Benchmarks** | **Before cft** | **After cft** |
|----------------|----------------|---------------|
| STS12 | 23.04 | 61.62 |
| STS13 | 20.79 | 71.87 |
| STS14 | 17.06 | 60.46 |
| STS15 | 24.56 | 71.18 |
| STS16 | 48.68 | 74.77 |
| STS17 | 41.43 | 80.20 |
| STSBenchmark | 37.87 | 79.46 |
| BOISSES | 28.04 | 64.06 |
| SICK-R | 48.40 | 74.37 |
| **Overall** | **32.21** | **70.89** |
## Contributors
Trapoom Ukarapol, Zhicheng Lee, Amy Xin
## Foot Notes
This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course ! |