spacemanidol
commited on
Commit
•
1f09ade
1
Parent(s):
639bf59
Update README.md
Browse files
README.md
CHANGED
@@ -5603,8 +5603,7 @@ model-index:
|
|
5603 |
value: 85.30624598674467
|
5604 |
license: apache-2.0
|
5605 |
---
|
5606 |
-
|
5607 |
-
<h1 align="center">Snowflake's Artic-embed-s</h1>
|
5608 |
<h4 align="center">
|
5609 |
<p>
|
5610 |
<a href=#news>News</a> |
|
@@ -5639,10 +5638,10 @@ The models are trained by leveraging existing open-source text representation mo
|
|
5639 |
|
5640 |
| Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
|
5641 |
| ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
|
5642 |
-
| [arctic-embed-
|
5643 |
| [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 | 33 | 384 |
|
5644 |
-
| [arctic-embed-
|
5645 |
-
| [arctic-embed-
|
5646 |
| [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 | 335 | 1024 |
|
5647 |
|
5648 |
|
@@ -5651,32 +5650,32 @@ Aside from being great open-source models, the largest model, [arctic-embed-l](h
|
|
5651 |
|
5652 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5653 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5654 |
-
| [arctic-embed-
|
5655 |
| Google-gecko-text-embedding | 55.7 |
|
5656 |
| text-embedding-3-large | 55.44 |
|
5657 |
| Cohere-embed-english-v3.0 | 55.00 |
|
5658 |
| bge-large-en-v1.5 | 54.29 |
|
5659 |
|
5660 |
|
5661 |
-
### [
|
5662 |
|
5663 |
|
5664 |
-
This tiny model packs quite the punch
|
5665 |
|
5666 |
|
5667 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5668 |
| ------------------------------------------------------------------- | -------------------------------- |
|
5669 |
-
| [arctic-embed-
|
5670 |
| GIST-all-MiniLM-L6-v2 | 45.12 |
|
5671 |
| gte-tiny | 44.92 |
|
5672 |
| all-MiniLM-L6-v2 | 41.95 |
|
5673 |
| bge-micro-v2 | 42.56 |
|
5674 |
|
5675 |
|
5676 |
-
### Arctic-embed-
|
5677 |
|
5678 |
|
5679 |
-
Based on the [all-MiniLM-L12-v2](https://huggingface.co/
|
5680 |
|
5681 |
|
5682 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
@@ -5688,37 +5687,36 @@ Based on the [all-MiniLM-L12-v2](https://huggingface.co/intfloat/e5-base-unsuper
|
|
5688 |
| e5-small-v2 | 49.04 |
|
5689 |
|
5690 |
|
5691 |
-
### [
|
5692 |
|
5693 |
|
5694 |
-
Based on the [
|
5695 |
|
5696 |
|
5697 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5698 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5699 |
-
| [arctic-embed-
|
5700 |
| bge-base-en-v1.5 | 53.25 |
|
5701 |
-
| nomic-embed-text-v1.5 | 53.
|
5702 |
| GIST-Embedding-v0 | 52.31 |
|
5703 |
| gte-base | 52.31 |
|
5704 |
|
5705 |
-
|
5706 |
-
### Arctic-embed-m
|
5707 |
|
5708 |
|
5709 |
-
Based on the [
|
5710 |
|
5711 |
|
5712 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5713 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5714 |
-
| [arctic-embed-
|
5715 |
-
|
|
5716 |
-
| nomic-embed-text-v1
|
5717 |
-
|
5718 |
-
| gte-base | 52.31 |
|
5719 |
|
5720 |
|
5721 |
-
|
|
|
5722 |
|
5723 |
|
5724 |
Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
|
@@ -5726,7 +5724,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
|
|
5726 |
|
5727 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5728 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5729 |
-
| [arctic-embed-
|
5730 |
| UAE-Large-V1 | 54.66 |
|
5731 |
| bge-large-en-v1.5 | 54.29 |
|
5732 |
| mxbai-embed-large-v1 | 54.39 |
|
@@ -5747,7 +5745,7 @@ You can use the transformers package to use an arctic-embed model, as shown belo
|
|
5747 |
import torch
|
5748 |
from transformers import AutoModel, AutoTokenizer
|
5749 |
|
5750 |
-
tokenizer = AutoTokenizer.from_pretrained('Snowflake/arctic-embed-')
|
5751 |
model = AutoModel.from_pretrained('Snowflake/arctic-embed-s', add_pooling_layer=False)
|
5752 |
model.eval()
|
5753 |
|
@@ -5779,15 +5777,6 @@ for query, query_scores in zip(queries, scores):
|
|
5779 |
print(score, document)
|
5780 |
```
|
5781 |
|
5782 |
-
|
5783 |
-
If you use the long context model with more than 2048 tokens, ensure that you initialize the model like below instead. This will use [RPE](https://arxiv.org/abs/2104.09864) to allow up to 8192 tokens.
|
5784 |
-
|
5785 |
-
|
5786 |
-
``` py
|
5787 |
-
model = AutoModel.from_pretrained('Snowflake/arctic-embed-m-long', trust_remote_code=True, rotary_scaling_factor=2)
|
5788 |
-
```
|
5789 |
-
|
5790 |
-
|
5791 |
## FAQ
|
5792 |
|
5793 |
|
@@ -5815,4 +5804,4 @@ We thank our modeling engineers, Danmei Xu, Luke Merrick, Gaurav Nuti, and Danie
|
|
5815 |
We thank our leadership, Himabindu Pucha, Kelvin So, Vivek Raghunathan, and Sridhar Ramaswamy, for supporting this work.
|
5816 |
We also thank the open-source community for producing the great models we could build on top of and making these releases possible.
|
5817 |
Finally, we thank the researchers who created BEIR and MTEB benchmarks.
|
5818 |
-
It is largely thanks to their tireless work to define what better looks like that we could improve model performance.
|
|
|
5603 |
value: 85.30624598674467
|
5604 |
license: apache-2.0
|
5605 |
---
|
5606 |
+
<h1 align="center">Snowflake's Artic-embed-m</h1>
|
|
|
5607 |
<h4 align="center">
|
5608 |
<p>
|
5609 |
<a href=#news>News</a> |
|
|
|
5638 |
|
5639 |
| Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
|
5640 |
| ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
|
5641 |
+
| [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 | 22 | 384 |
|
5642 |
| [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 | 33 | 384 |
|
5643 |
+
| [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 | 110 | 768 |
|
5644 |
+
| [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 | 137 | 768 |
|
5645 |
| [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 | 335 | 1024 |
|
5646 |
|
5647 |
|
|
|
5650 |
|
5651 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5652 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5653 |
+
| [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
|
5654 |
| Google-gecko-text-embedding | 55.7 |
|
5655 |
| text-embedding-3-large | 55.44 |
|
5656 |
| Cohere-embed-english-v3.0 | 55.00 |
|
5657 |
| bge-large-en-v1.5 | 54.29 |
|
5658 |
|
5659 |
|
5660 |
+
### [Arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs)
|
5661 |
|
5662 |
|
5663 |
+
This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model with only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
|
5664 |
|
5665 |
|
5666 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5667 |
| ------------------------------------------------------------------- | -------------------------------- |
|
5668 |
+
| [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 |
|
5669 |
| GIST-all-MiniLM-L6-v2 | 45.12 |
|
5670 |
| gte-tiny | 44.92 |
|
5671 |
| all-MiniLM-L6-v2 | 41.95 |
|
5672 |
| bge-micro-v2 | 42.56 |
|
5673 |
|
5674 |
|
5675 |
+
### [Arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s)
|
5676 |
|
5677 |
|
5678 |
+
Based on the [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
|
5679 |
|
5680 |
|
5681 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
|
|
5687 |
| e5-small-v2 | 49.04 |
|
5688 |
|
5689 |
|
5690 |
+
### [Arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/)
|
5691 |
|
5692 |
|
5693 |
+
Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
|
5694 |
|
5695 |
|
5696 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5697 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5698 |
+
| [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 |
|
5699 |
| bge-base-en-v1.5 | 53.25 |
|
5700 |
+
| nomic-embed-text-v1.5 | 53.25 |
|
5701 |
| GIST-Embedding-v0 | 52.31 |
|
5702 |
| gte-base | 52.31 |
|
5703 |
|
5704 |
+
### [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/)
|
|
|
5705 |
|
5706 |
|
5707 |
+
Based on the [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
|
5708 |
|
5709 |
|
5710 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5711 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5712 |
+
| [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 |
|
5713 |
+
| nomic-embed-text-v1.5 | 53.01 |
|
5714 |
+
| nomic-embed-text-v1 | 52.81 |
|
5715 |
+
|
|
|
5716 |
|
5717 |
|
5718 |
+
|
5719 |
+
### [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/)
|
5720 |
|
5721 |
|
5722 |
Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
|
|
|
5724 |
|
5725 |
| Model Name | MTEB Retrieval Score (NDCG @ 10) |
|
5726 |
| ------------------------------------------------------------------ | -------------------------------- |
|
5727 |
+
| [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
|
5728 |
| UAE-Large-V1 | 54.66 |
|
5729 |
| bge-large-en-v1.5 | 54.29 |
|
5730 |
| mxbai-embed-large-v1 | 54.39 |
|
|
|
5745 |
import torch
|
5746 |
from transformers import AutoModel, AutoTokenizer
|
5747 |
|
5748 |
+
tokenizer = AutoTokenizer.from_pretrained('Snowflake/arctic-embed-s')
|
5749 |
model = AutoModel.from_pretrained('Snowflake/arctic-embed-s', add_pooling_layer=False)
|
5750 |
model.eval()
|
5751 |
|
|
|
5777 |
print(score, document)
|
5778 |
```
|
5779 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5780 |
## FAQ
|
5781 |
|
5782 |
|
|
|
5804 |
We thank our leadership, Himabindu Pucha, Kelvin So, Vivek Raghunathan, and Sridhar Ramaswamy, for supporting this work.
|
5805 |
We also thank the open-source community for producing the great models we could build on top of and making these releases possible.
|
5806 |
Finally, we thank the researchers who created BEIR and MTEB benchmarks.
|
5807 |
+
It is largely thanks to their tireless work to define what better looks like that we could improve model performance.
|