license: cc-by-nc-4.0
tags:
- feature-extraction
- sentence-similarity
- mteb
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
inference: false
library_name: transformers
The embedding set trained by Jina AI.
Jina Embedding V3: A Multilingual Multi-Task Embedding Model
Quick Start
The easiest way to starting using jina-embeddings-v3
is to use Jina AI's Embedding API.
Intended Usage & Model Info
jina-embeddings-v3
is a multilingual multi-task text embedding model designed for a variety of NLP applications.
Based on the XLM-RoBERTa architecture,
this model supports Rotary Position Embeddings (RoPE) to handle long sequences up to 8192 tokens.
Additionally, it features LoRA adapters to generate task-specific embeddings efficiently.
Key Features:
- Extended Sequence Length: Supports up to 8192 tokens with RoPE.
- Task-Specific Embedding: Customize embeddings through the
task_type
argument with the following options:retrieval.query
: Query encoding for asymmetric retrieval tasksretrieval.passage
: Passage encoding for asymmetric retrieval tasksseparation
: For clustering and re-ranking applicationsclassification
: For classification taskstext-matching
: For measuring textual similarity
- Matryoshka Embeddings: Supports flexible embedding sizes (
32, 64, 128, 256, 512, 768, 1024
), allowing for truncating embeddings to fit your application.
Model Lineage:
jina-embeddings-v3
builds upon the FacebookAI/xlm-roberta-large model, which was originally trained on 100 languages.
We extended its capabilities with an extra pretraining phase on the CulturaX dataset,
then contrastively fine-tuned it on 30 languages for enhanced performance in both monolingual and cross-lingual setups.
Supported Languages:
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.
Data & Parameters
The data and training details are described in the technical report (coming soon).
Usage
- The easiest way to starting using jina-clip-v1-en is to use Jina AI's Embeddings API.
- Alternatively, you can use Jina CLIP directly via transformers package.
!pip install transformers einops flash_attn
from transformers import AutoModel
# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
# New meaningful sentences
sentences = [
"Organic skincare for sensitive skin with aloe vera and chamomile.",
"New makeup trends focus on bold colors and innovative techniques",
"Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
"Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
"Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
"Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
"针对敏感肌专门设计的天然有机护肤产品",
"新的化妆趋势注重鲜艳的颜色和创新的技巧",
"敏感肌のために特別に設計された天然有機スキンケア製品",
"新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています",
]
# Encode sentences
embeddings = model.encode(sentences, truncate_dim=1024, task_type='index') # TODO UPDATE
# Compute similarities
print(embeddings[0] @ embeddings[1].T)
Performance
TODO UPDATE THIS
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find jina-embeddings-v3
useful in your research, please cite the following paper: