taylorbollman (Taylor Bollman)

liked a model about 1 month ago

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

Text Generation • Updated Oct 25 • 202k • 1.72k

upvoted a paper 2 months ago

OLMoE: Open Mixture-of-Experts Language Models

Paper • 2409.02060 • Published Sep 3 • 77

liked a model 7 months ago

Alibaba-NLP/gte-base-en-v1.5

updated a model 8 months ago

taylorbollman/bertnomic_2048_forGLUE_test

Feature Extraction • Updated Mar 26 • 3

liked a model 8 months ago

mosaicml/mosaic-bert-base-seqlen-2048

Fill-Mask • Updated Mar 5 • 198 • 20

updated 2 datasets 8 months ago

taylorbollman/bertnomic_tokenized_1024

Viewer • Updated Mar 22 • 5.3M • 44

taylorbollman/bertnomic-1024

Updated Mar 22 • 3

liked a dataset 8 months ago

MAKILINGDING/english_dictionary

Viewer • Updated Jan 18 • 124k • 64 • 10

New activity in nomic-ai/nomic-bert-2048 8 months ago

Porting change from nomic-embed-text-v1 to nomic-bert-2048

1

#5 opened 8 months ago by

taylorbollman

updated a dataset 9 months ago

taylorbollman/wikitext2_tb

Viewer • Updated Mar 8 • 22.6k • 30

liked a model 9 months ago

nomic-ai/nomic-bert-2048

Fill-Mask • Updated 6 days ago • 20.8k • 28

updated 2 models 9 months ago

taylorbollman/tbnomic2040v1_4

Fill-Mask • Updated Mar 6 • 4

taylorbollman/tbnomic2040v11

Fill-Mask • Updated Mar 6 • 7

liked a dataset 9 months ago

nomic-ai/nomic-bert-2048-pretraining-data

Viewer • Updated Dec 24, 2023 • 2.65M • 201 • 1

Reacted to bwang0911's post with ❤️ 9 months ago

Post

@jinaai , we've recently launched an interesting model: jinaai/jina-colbert-v1-en. In this post, I'd like to give you a quick introduction to ColBERT: the multi-vector search & late interaction retriever.

As you may already know, we've been developing embedding models such as jinaai/jina-embeddings-v2-base-en for some time. These models, often called 'dense retrievers', generate a single representation for each document.

Embedding models like Jina-v2 have the advantage of quick integration with vector databases and good performance within a specific domain.

When discussing tasks within a specific domain, it means embedding models can perform very well by "seeing similar distributions". However, this also suggests that they might only perform "okay" on tasks outside of that domain and require fine-tuning.

Now, let's delve into multi-vector search and late-interaction models. The idea is quite simple:

1. During model training, you apply dimensionality reduction to decrease the vector dimensionality from 768 to 128 to save storage.
2. Now, with one query and one document, you match each query token embedding against every token embedding in the document to find the maximum similarity score. Repeat this process for each token in the query, from the second to the last, and then sum up all the maximum similarity scores.

This process is called multi-vector search because if your query has 5 tokens, you're keeping 5 * 128 token embeddings. The "max similarity sum-up" procedure is termed late interaction.

Multi-vector & Late interaction retrievers have the advantage of:

1. Excellent performance outside of a specific domain since they match at a token-level granularity.
2. Explainability: you can interpret your token-level matching and understand why the score is higher/lower.

Try our first multi-vector search at jinaai/jina-colbert-v1-en and share your feedback!