Introduction

Based on dunzhang/stella_en_1.5B_v5 and google/siglip-so400m-patch14-384.

It can encode both text and image.

The training code, data and report will be released soon.

The core training code will be integrated into the rag-retrieval library(https://github.com/NLPJCL/RAG-Retrieval) in the near future. (Welcome to star)

This work was accomplished during my free time; please grant time a little time.

Here's a short introduction to the training method:

The core idea of jasper and stella is distillation: Let student model learn teacher model's vectors. The training process of jasper have 4 stage:

Stage1&2: Distill from teacher vectors. In jasper model the teacher model is nvidia/NV-Embed-v2 and dunzhang/stella_en_1.5B_v5 (Stage1 and Stage2 will freeze different parameters.)

Stage3: MRL training, I made some modifications to MRL to enable training on unsupervised text

Stage4: Alignment between jasper token embeddings from image's detailed caption and vision embeddings from google/siglip-so400m-patch14-384.

I use a AdaptiveAvgPool2d to do an adjustment on vision tokens' number and dimensions, this method does not need additional parameters.

The meaning of distillation is to achieve better results with smaller models or as a way of pre-training, not to hit the top of the leaderboards. Actually, I've got first place on MTEB (Chinese and English), I will not release the two models, as I said before, it's meaningless and has poor generalisability.

Usage

import torch
from sentence_transformers import SentenceTransformer


DOC1 = """
Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. 
Blue is scattered more than other colors because it travels as shorter, smaller waves. This is why we see a blue sky most of the time. 
Closer to the horizon, the sky fades to a lighter blue or white.
"""
DOC2 = """
When choosing colors, you can consider the following factors:
Color theory: Understand how colors work together and how they can evoke different reactions. 
Color psychology: Consider how colors affect emotions, behaviors, and responses. 
Brand identity: Colors can convey meaning and information about a brand. 
Mood: Consider the mood you want to create. For example, brighter colors can feel cheerful, while cooler colors can be calming.
Space: Consider the size of the space and the amount of natural light it receives. Dark colors can make a room feel smaller, while light colors can make it feel larger.
Color wheel: Use the color wheel to identify primary, secondary, and tertiary colors. 
Color combinations: Decide how to best complement your preferred color with others. 
Color palette: Limit your color palette to a main color and one or two additional colors. 
60-30-10 rule: Use a primary color 60% of the time, a secondary color 30% of the time, and an accent color 10% of the time
"""
if __name__ == "__main__":
    # load model
    use_gpu = False
    model_name = "infgrad/jasper_en_vision_language_v1"
    model = SentenceTransformer(
        model_name,
        trust_remote_code=True,
        device="cpu" if not use_gpu else "cuda",
        model_kwargs={
            "torch_dtype": torch.bfloat16 if use_gpu else torch.float32,
            "attn_implementation": "sdpa"
        },
        # vector_dim must be 12288, 1024, 512, 256
        ## 1024 is recommended
        # set is_text_encoder 'True', if you do not encode image
        config_kwargs={"is_text_encoder": False, "vector_dim": 1024},
    )
    # We can reduce the max_seq_length from the default of 2048 for faster encoding
    model.max_seq_length = 1024

    # data
    q_list = [
        "Why the sky is blue?",
        "how to choose suitable color",
    ]
    doc_list = [
        DOC1,
        [{"type": "image_path", "content": "./assets/img1.png"}, {"type": "text", "content": "Hope this image helps!"}],
        DOC2,
        [{"type": "image_path", "content": "./assets/img2.png"}],
    ]
    q_vecs = model.encode(q_list, prompt_name="s2p_query")
    doc_vecs = model.encode(doc_list)

    # calculate similarity
    similarities = model.similarity(q_vecs, doc_vecs)
    print(similarities)
    # the output is:
    # tensor([[0.7775, 0.7594, 0.2429, 0.2187],
    #         [0.3226, 0.3054, 0.7421, 0.5484]])

Evaluation on MTEB

script: ./scripts/evaluate_en_mteb/run_evaluate_mteb.py

License

This model should not be used for any commercial purpose!

Downloads last month
494
Safetensors
Model size
1.99B params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for infgrad/jasper_en_vision_language_v1

Finetuned
(11)
this model

Datasets used to train infgrad/jasper_en_vision_language_v1

Evaluation results