How to Fine-Tune Custom Embedding Models Using AutoTrain

Community Article Published May 30, 2024

AutoTrain is a powerful no-code tool that allows you to train or fine-tune many different state of the art models including Sentence Transformer models on your own datasets with ease. Here’s a simple guide to get you started on fine-tuning your custom embedding models using AutoTrain.

Types of Sentence Transformer Fine-Tuning in AutoTrain

AutoTrain supports various types of sentence transformer fine-tuning tasks:

  1. pair: Dataset with two sentences: anchor and positive.
  2. pair_class: Dataset with two sentences: premise and hypothesis with a target label.
  3. pair_score: Dataset with two sentences: sentence1 and sentence2 with a target score.
  4. triplet: Dataset with three sentences: anchor, positive, and negative.
  5. qa: Dataset with two sentences: query and answer.

Data Format

AutoTrain accepts data in CSV or JSONL format. You can also use a dataset from Hugging Face Hub. Let’s look at the format for each task.

  • pair:

    | anchor                          | positive                        |
    |---------------------------------|---------------------------------|
    | hello                           | hi                              |
    | how are you                     | I am fine                       |
    | What is your name?              | My name is Abhishek             |
    | Which is the best programming language? | Python                |
    
  • pair_class:

    | premise                         | hypothesis                      | label |
    |---------------------------------|---------------------------------|-------|
    | hello                           | hi                              | 1     |
    | how are you                     | I am fine                       | 0     |
    | What is your name?              | My name is Abhishek             | 1     |
    | Which is the best programming language? | Python                | 1     |
    
  • pair_score:

    | sentence1                       | sentence2                       | score |
    |---------------------------------|---------------------------------|-------|
    | hello                           | hi                              | 0.8   |
    | how are you                     | I am fine                       | 0.2   |
    | What is your name?              | My name is Abhishek             | 0.9   |
    | Which is the best programming language? | Python                | 0.7   |
    
  • triplet:

    | anchor                          | positive                        | negative                        |
    |---------------------------------|---------------------------------|---------------------------------|
    | hello                           | hi                              | bye                             |
    | how are you                     | I am fine                       | I am not fine                   |
    | What is your name?              | My name is Abhishek             | Whats it to you?                |
    | Which is the best programming language? | Python                | Javascript                      |
    
  • qa:

    | query                           | answer                          |
    |---------------------------------|---------------------------------|
    | hello                           | hi                              |
    | how are you                     | I am fine                       |
    | What is your name?              | My name is Abhishek             |
    | Which is the best programming language? | Python                |
    

Column Mapping

Column mapping is crucial for AutoTrain to understand the role of each column in your dataset. Here’s how you can map columns for each task:

Task Column Mapping
pair {"sentence1_column": "anchor", "sentence2_column": "positive"}
pair_class {"sentence1_column": "premise", "sentence2_column": "hypothesis", "target_column": "label"}
pair_score {"sentence1_column": "sentence1", "sentence2_column": "sentence2", "target_column": "score"}
triplet {"sentence1_column": "anchor", "sentence2_column": "positive", "sentence3_column": "negative"}
qa {"sentence1_column": "query", "sentence2_column": "answer"}

Tips for Accurate Mapping

  • Verify Column Names: Ensure that the names used in the mapping dictionary match those in your dataset.
  • Update Mappings for New Datasets: Each dataset might need unique mappings based on its structure.

By following these steps and ensuring proper data formatting and column mapping, you can effectively fine-tune custom embedding models using AutoTrain.

Fine-Tuning Using CLI with AutoTrain

Once you have your data and column mappings set up, you can proceed to fine-tune your Sentence Transformer model using the AutoTrain CLI. Below is an example configuration for a triplet task.

Configuration File (config.yml)

Create a config.yml file with the following content:

task: sentence-transformers:triplet
base_model: microsoft/mpnet-base
project_name: autotrain-st-triplet
log: tensorboard
backend: local

data:
  path: sentence-transformers/all-nli
  train_split: triplet:train
  valid_split: triplet:dev
  column_mapping:
    sentence1_column: anchor
    sentence2_column: positive
    sentence3_column: negative

params:
  max_seq_length: 512
  epochs: 5
  batch_size: 8
  lr: 2e-5
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 1
  mixed_precision: fp16

hub:
  username: ${HF_USERNAME}
  token: ${HF_TOKEN}
  push_to_hub: true

Steps to Fine-Tune Using CLI

  1. Install AutoTrain-Advanced: Ensure you have autotrain-advanced installed. You can install it via pip:

    pip install autotrain-advanced
    
  2. Prepare Configuration: Save the above YAML configuration as config.yml.

  3. Run Training: Execute the following command to start the fine-tuning process:

    autotrain --config config.yml
    

Configuration for Different Tasks

Here’s how the config.yml file will change for different tasks:

Pair Task

task: sentence-transformers:pair

data:
  column_mapping:
    sentence1_column: anchor
    sentence2_column: positive

Pair Class Task

task: sentence-transformers:pair_class

data:
  column_mapping:
    sentence1_column: premise
    sentence2_column: hypothesis
    target_column: label

Pair Score Task

task: sentence-transformers:pair_score

data:
  column_mapping:
    sentence1_column: sentence1
    sentence2_column: sentence2
    target_column: score

QA Task

task: sentence-transformers:qa

data:
  column_mapping:
    sentence1_column: query
    sentence2_column: answer

Please note for all tasks, you need different types of datasets and column mappings will change depending on the column names available in your dataset. Here's and example dataset with most types of trainers above.

Train on Hugging Face Spaces via AutoTrain UI

If you want to train on Hugging Face Spaces using the AutoTrain UI, you can click here, attach appropriate hardware and create the space, choose the correct task. The resulting UI will look like:

image/png

Clicking on Start Training will start the training process :)

P.S. you can do the same on Google Colab! Check out the GitHub Repo to learn more

Summary

By following these steps and ensuring proper data formatting and column mapping, you can effectively fine-tune custom embedding models using AutoTrain, whether through the web interface or the command-line interface. With a straightforward configuration file, you can leverage AutoTrain to train powerful Sentence Transformer models tailored to your specific needs.

Happy training!