metadata

license: mit
language:
  - en
tags:
  - sentence-embedding
  - sentence-similarity
  - transformers
  - feature-extraction
pipeline_tag: sentence-similarity

Phi-2-Text-Embedding-cft

Description

This is a fine-tuned version of Phi-2 to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. The paper can be found here.

Base Model

Phi-2

Usage

Clone Phi-2 repository

git clone https://huggingface.co/microsoft/phi-2

Change a tokenizer setting in tokenizer_config.json

"add_eos_token": true

Use the model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class PhiSentenceEmbedding:
    def __init__(self, model_path='microsoft/phi-2', adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path, 
                                                          torch_dtype=torch.bfloat16,
                                                          device_map='cuda',
                                                          trust_remote_code=True)
        if adapter_path != None:
            # Load fine-tuned LoRA
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
        """
        Returns a list of embeddings for the given sentences.
        
        Args:
            sentences: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """

        out = []

        for s in sentences:
            out.append(self.get_last_hidden_state(s))

        return out

phi_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/Phi-2-Text-Embedding-cft')

example_sentences = ["I don't like apples", "I like apples"]

encoded_sentences = phi_sentence_embedding.encode(example_sentences)

print(encoded_sentences)

Training Details

Training Details	Value
Loss	InfoNCE
Batch Size	60
InfoNCE Temperature	0.05
Learning Rate	5e-05
Warmup Steps	100
Learning Rate Scheduler	CosineAnnealingLR
LoRA Rank	8
LoRA Alpha	32
LoRA Dropout	0.1
Training Precision	bf16
Max Epoch	1
GPU	RTX3090
Num GPUs	4

Training Scripts

The training script for this model is written in this Github repository.

Checkpoints

We provide checkpoints every 500 training steps which can be found here.

Evaluation Results

Benchmarks	Before cft	After cft
STS12	23.04	61.62
STS13	20.79	71.87
STS14	17.06	60.46
STS15	24.56	71.18
STS16	48.68	74.77
STS17	41.43	80.20
STSBenchmark	37.87	79.46
BOISSES	28.04	64.06
SICK-R	48.40	74.37
Overall	32.21	70.89

Contributors

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

Foot Notes

This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !