---
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
language:
- es
- en
inference: false
license: apache-2.0
---
The text embedding set trained by Jina AI.
## Quick Start The easiest way to starting using `jina-embeddings-v2-base-de` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/). ## Intended Usage & Model Info `jina-embeddings-v2-base-es` is a Spanish/English bilingual text **embedding model** supporting **8192 sequence length**. It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. We have designed it for high performance in mono-lingual & cross-lingual applications and trained it specifically to support mixed Spanish-English input without bias. Additionally, we provide the following embedding models: `jina-embeddings-v2-base-es` es un modelo (embedding) de texto bilingüe Inglés/Español que admite una longitud de secuencia de 8192. Se basa en la arquitectura BERT (JinaBERT) que incorpora la variante bi-direccional simétrica de [ALiBi](https://arxiv.org/abs/2108.12409) para permitir una mayor longitud de secuencia. Hemos diseñado este modelo para un alto rendimiento en aplicaciones monolingües y bilingües, y está entrenando específicamente para admitir entradas mixtas de español e inglés sin sesgo. Adicionalmente, proporcionamos los siguientes modelos (embeddings): - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters. - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters. - [`jina-embeddings-v2-base-zh`](): Chinese-English Bilingual embeddings (soon). - [`jina-embeddings-v2-base-de`](): German-English Bilingual embeddings (soon). - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon) **(you are here)**. ## Data & Parameters Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923) ## Usage **### Why mean pooling? `mean poooling` takes all token embeddings from model output and averaging them at sentence/paragraph level. It has been proved to be the most effective way to produce high-quality sentence embeddings. We offer an `encode` function to deal with this. However, if you would like to do it without using the default `encode` function: ```python import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['How is the weather today?', 'What is the current weather like today?'] tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-es') model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-es', trust_remote_code=True) encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) embeddings = mean_pooling(model_output, encoded_input['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) ```