File size: 4,418 Bytes

---
license: mit
language:
- en
tags:
- sentence-embedding
- sentence-similarity
- transformers
- feature-extraction
pipeline_tag: sentence-similarity
---

# Phi-2-Text-Embedding-cft

## Description

This is a fine-tuned version of [Phi-2](https://huggingface.co/microsoft/phi-2) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. The paper can be found [here](https://arxiv.org/abs/2408.00690).

## Base Model

[Phi-2](https://huggingface.co/microsoft/phi-2)

## Usage

1. Clone Phi-2 repository

```bash
git clone https://huggingface.co/microsoft/phi-2
```

2. Change a tokenizer setting in `tokenizer_config.json`

```json
"add_eos_token": true
```

3. Use the model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class PhiSentenceEmbedding:
    def __init__(self, model_path='microsoft/phi-2', adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, 
                                                          torch_dtype=torch.bfloat16,
                                                          device_map='cuda',
                                                          trust_remote_code=True)
        if adapter_path != None:
            # Load fine-tuned LoRA
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
        """
        Returns a list of embeddings for the given sentences.
        
        Args:
            sentences: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """

        out = []

        for s in sentences:
            out.append(self.get_last_hidden_state(s))

        return out

phi_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/Phi-2-Text-Embedding-cft')

example_sentences = ["I don't like apples", "I like apples"]

encoded_sentences = phi_sentence_embedding.encode(example_sentences)

print(encoded_sentences) 

```

## Training Details

| **Training Details**    | **Value**         |
|-------------------------|-------------------|
| Loss                    | InfoNCE           |
| Batch Size              | 60                |
| InfoNCE Temperature     | 0.05              |
| Learning Rate           | 5e-05             |
| Warmup Steps            | 100               |
| Learning Rate Scheduler | CosineAnnealingLR |
| LoRA Rank               | 8                 |
| LoRA Alpha              | 32                |
| LoRA Dropout            | 0.1               |
| Training Precision      | bf16              |
| Max Epoch               | 1                 |
| GPU                     | RTX3090           |
| Num GPUs                | 4                 |

## Training Scripts

The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).

## Checkpoints

We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/Phi-2-Text-Embedding-cft-checkpoints).

## Evaluation Results

| **Benchmarks** | **Before cft** | **After cft** |
|----------------|----------------|---------------|
| STS12          | 23.04          | 61.62         |
| STS13          | 20.79          | 71.87         |
| STS14          | 17.06          | 60.46         |
| STS15          | 24.56          | 71.18         |
| STS16          | 48.68          | 74.77         |
| STS17          | 41.43          | 80.20         |
| STSBenchmark   | 37.87          | 79.46         |
| BOISSES        | 28.04          | 64.06         |
| SICK-R         | 48.40          | 74.37         |
| **Overall**    | **32.21**      | **70.89**     |

## Contributors

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

## Foot Notes

This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !