MarcusLoren's picture
Update README.md
9a36897 verified
metadata
license: mit
pipeline_tag: text-classification

This is a fine-tuned version of BAAI/bge-reranker-base.

I created a dataset of 89k items, I used a LLM to extract and label relevant information from job adverts description and used the unlabeled data as the negatives. I used the same type of method to extract labels from the LLM by scraping different company websites so I can get labels for e.g product, service and upcoming offerings.

I fine-tuned the reranker by using the labels from the LLM and then inserting "Example of" in the start of the query to provide the model more context and intent of the query.

Fine-tuned querys:

Example of education
Example of certification
Example of qualifications
Example of work experience
Example of hard skills
Example of soft skills 
Example of benefits
Example of breadtext
Example of company culture
Example of product
Example of service
Example of upcoming offerings
Example of job duties

It works pretty well, not 100% since that might require more data but I could only train it for max 12hr due to Kaggle's sessions restrictions so I couldn't trainer it on a larger dataset.

image/png

You can load the model as you usually load the FlagReranker model:

from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(["Example of education", "Requires bachelor degree"])
print(score) 

Or using tranformer:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.eval()

pairs = [["Example of education", "Requires bachelor degree"]]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

Along with the model files I also uploaded a ONNX file which you load in .NET (C#) using the below. For tokenizing the input i used BlingFire since they have the sentence piece model for xlm_roberta_base, the only special thing you'll need is to add the special tokens that indicate the start and EOF. ONNX model: https://huggingface.co/MarcusLoren/Reranker-job-description/blob/main/Reranker.onnx

   public class RankerInput
   {  
       public long[] input_ids { get; set; } 
       public long[] attention_mask { get; set; }
   }
       public class RankedOutput
   {
       public float[] logits { get; set; }
   }
 _mlContext = new MLContext();
 
 var onnxModelPath = "Reranker.onnx";
 var dataView = _mlContext.Data.LoadFromEnumerable(new List<RankerInput>());
 var pipeline = _mlContext.Transforms.ApplyOnnxModel(
     modelFile: onnxModelPath,
     gpuDeviceId: 0,
     outputColumnNames: new[] { nameof(RankedOutput.logits) },
     inputColumnNames: new[] { nameof(RankerInput.input_ids), nameof(RankerInput.attention_mask) });
 rankerModel = pipeline.Fit(dataView);
 var predictionEngine = _mlContext.Model.CreatePredictionEngine<RankerInput, RankedOutput>(rankerModel);

 var tokened_input = Tokenize(["Example of education", "Requires bachelor degree"])
 
 var pred = predictionEngine.Predict(tokened_input)
 var score = pred.logits[0];  // e.g 0.99
  
  private RankerInput Tokenize(string[] pair)
  {   
        List<long> input_ids =
        [
                0,
                .. TokenizeText(pair[0]),
                2,
                2,
                .. TokenizeText(pair[1]),
                2,
            ];

        var attention_mask = Enumerable.Repeat((long)1, input_ids.Count).ToArray();
        return new RankerInput() { input_ids = input_ids.ToArray(), attention_mask = attention_mask }; 
}
 
 var TokenizerModel = BlingFireUtils.LoadModel(@".\xlm_roberta_base.bin");
 public int[] TokenizeText(string text)
 {
     List<int> tokenized = new();
     foreach (var chunk in text.Split(' ').Chunk(80)) {
 
         int[] labelIds = new int[128]; 
         byte[] inBytes = Encoding.UTF8.GetBytes(string.Join(" ", chunk));
         var outputCount = BlingFireUtils2.TextToIds(TokenizerModel, inBytes, inBytes.Length, labelIds, labelIds.Length, 0);
         Array.Resize(ref labelIds, outputCount);
         tokenized.AddRange(labelIds);
     }
     return tokenized.ToArray();
 }

license: mit