we are very proud to introduce jinaai/jina-clip-v1, aka "jina-embeddings-multimodal".
The OpenAI CLIP openai/clip-vit-base-patch32 have nice performance to align text and image modality, that user can perform cross-modal text image retrieval or image classification on top of it. However, due to the training data and recipe, it can not:
1. model longer sequence of text inputs (77 token constraint). 2. align text representations (CLIP Text Tower is weak for text search).
1. Stronger cross-modal performance against OpenAI sets, 2% and 6% improvement on cross-modal retrieval recall@5. 2. Text tower of the JinaCLIP is a strong text encoder, reach the same performance as jinaai/jina-embeddings-v2-base-en, 165% improvement on MTEB[BEIR] recall@5. 3. Image tower of the JinaCLIP also shows strong performance in image-image search (CBIR), 12% recall improvement on Cifar100 test set.
If you are working on MuRAG (multimodal-retrieval argumented generation), try it out!
In the vector search setup, we normally combine a fast embedding model and an accurate but slow reranker model.
The newly released @jinaai rerankers are small in size and almost as accurate as our base reranker. This means given a time constraint, it can scoring more candidate documents from embedding models and have a better chance to feed LLM the correct context for RAG generation.
These models are available on Huggingface and has been integrated into the latest SentenceTransformers 2.7.0. Check it out!