spacemanidol commited on
Commit
afdcd80
1 Parent(s): 12b9958

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -25
README.md CHANGED
@@ -6,9 +6,9 @@ tags:
6
  - sentence-similarity
7
  - mteb
8
  - arctic
9
- - arctic-embed
10
  model-index:
11
- - name: snowflake-arctic-embed-xs
12
  results:
13
  - task:
14
  type: Classification
@@ -2822,16 +2822,16 @@ model-index:
2822
  ## News
2823
 
2824
 
2825
- 04/16/2024: Release the ** Arctic-embed ** family of text embedding models. The releases are state-of-the-art for Retrieval quality at each of their representative size profiles. [Technical Report]() is coming shortly. For more details, please refer to our Github: [Arctic-Text-Embed](https://github.com/Snowflake-Labs/arctic-embed).
2826
 
2827
 
2828
  ## Models
2829
 
2830
 
2831
- Arctic-Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
2832
 
2833
 
2834
- The `arctic-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.
2835
 
2836
 
2837
  The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
@@ -2839,26 +2839,26 @@ The models are trained by leveraging existing open-source text representation mo
2839
 
2840
  | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
2841
  | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
2842
- | [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 | 22 | 384 |
2843
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 | 33 | 384 |
2844
- | [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 | 110 | 768 |
2845
- | [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 | 137 | 768 |
2846
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 | 335 | 1024 |
2847
 
2848
 
2849
- Aside from being great open-source models, the largest model, [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/), can serve as a natural replacement for closed-source embedding, as shown below.
2850
 
2851
 
2852
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2853
  | ------------------------------------------------------------------ | -------------------------------- |
2854
- | [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2855
  | Google-gecko-text-embedding | 55.7 |
2856
  | text-embedding-3-large | 55.44 |
2857
  | Cohere-embed-english-v3.0 | 55.00 |
2858
  | bge-large-en-v1.5 | 54.29 |
2859
 
2860
 
2861
- ### [Arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs)
2862
 
2863
 
2864
  This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model with only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
@@ -2866,14 +2866,14 @@ This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://h
2866
 
2867
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2868
  | ------------------------------------------------------------------- | -------------------------------- |
2869
- | [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 |
2870
  | GIST-all-MiniLM-L6-v2 | 45.12 |
2871
  | gte-tiny | 44.92 |
2872
  | all-MiniLM-L6-v2 | 41.95 |
2873
  | bge-micro-v2 | 42.56 |
2874
 
2875
 
2876
- ### [Arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s)
2877
 
2878
 
2879
  Based on the [infloat/e5-small-unsupervised](https://huggingface.co/intfloat/e5-small-unsupervised) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
@@ -2881,14 +2881,14 @@ Based on the [infloat/e5-small-unsupervised](https://huggingface.co/intfloat/e5-
2881
 
2882
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2883
  | ------------------------------------------------------------------ | -------------------------------- |
2884
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 |
2885
  | bge-small-en-v1.5 | 51.68 |
2886
  | Cohere-embed-english-light-v3.0 | 51.34 |
2887
  | text-embedding-3-small | 51.08 |
2888
  | e5-small-v2 | 49.04 |
2889
 
2890
 
2891
- ### [Arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/)
2892
 
2893
 
2894
  Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
@@ -2896,13 +2896,13 @@ Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-
2896
 
2897
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2898
  | ------------------------------------------------------------------ | -------------------------------- |
2899
- | [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 |
2900
  | bge-base-en-v1.5 | 53.25 |
2901
  | nomic-embed-text-v1.5 | 53.25 |
2902
  | GIST-Embedding-v0 | 52.31 |
2903
  | gte-base | 52.31 |
2904
 
2905
- ### [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/)
2906
 
2907
 
2908
  Based on the [nomic-embed-text-v1-unsupervised](https://huggingface.co/nomic-ai/nomic-embed-text-v1-unsupervised) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
@@ -2910,14 +2910,14 @@ Based on the [nomic-embed-text-v1-unsupervised](https://huggingface.co/nomic-ai/
2910
 
2911
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2912
  | ------------------------------------------------------------------ | -------------------------------- |
2913
- | [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 |
2914
  | nomic-embed-text-v1.5 | 53.01 |
2915
  | nomic-embed-text-v1 | 52.81 |
2916
 
2917
 
2918
 
2919
 
2920
- ### [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/)
2921
 
2922
 
2923
  Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
@@ -2925,7 +2925,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
2925
 
2926
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2927
  | ------------------------------------------------------------------ | -------------------------------- |
2928
- | [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2929
  | UAE-Large-V1 | 54.66 |
2930
  | bge-large-en-v1.5 | 54.29 |
2931
  | mxbai-embed-large-v1 | 54.39 |
@@ -2938,7 +2938,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
2938
  ### Using Huggingface transformers
2939
 
2940
 
2941
- You can use the transformers package to use an arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
2942
 
2943
 
2944
 
@@ -2946,8 +2946,8 @@ You can use the transformers package to use an arctic-embed model, as shown belo
2946
  import torch
2947
  from transformers import AutoModel, AutoTokenizer
2948
 
2949
- tokenizer = AutoTokenizer.from_pretrained('Snowflake/arctic-embed-xs')
2950
- model = AutoModel.from_pretrained('Snowflake/arctic-embed-xs', add_pooling_layer=False)
2951
  model.eval()
2952
 
2953
  query_prefix = 'Represent this sentence for searching relevant passages: '
 
6
  - sentence-similarity
7
  - mteb
8
  - arctic
9
+ - snowflake-arctic-embed
10
  model-index:
11
+ - name: snowflake-snowflake-arctic-embed-xs
12
  results:
13
  - task:
14
  type: Classification
 
2822
  ## News
2823
 
2824
 
2825
+ 04/16/2024: Release the ** snowflake-arctic-embed ** family of text embedding models. The releases are state-of-the-art for Retrieval quality at each of their representative size profiles. [Technical Report]() is coming shortly. For more details, please refer to our Github: [Arctic-Text-Embed](https://github.com/Snowflake-Labs/snowflake-arctic-embed).
2826
 
2827
 
2828
  ## Models
2829
 
2830
 
2831
+ snowflake-arctic-embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
2832
 
2833
 
2834
+ The `snowflake-arctic-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/snowflake-arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.
2835
 
2836
 
2837
  The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
 
2839
 
2840
  | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
2841
  | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
2842
+ | [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/) | 50.15 | 22 | 384 |
2843
+ | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s/) | 51.98 | 33 | 384 |
2844
+ | [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/) | 54.90 | 110 | 768 |
2845
+ | [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/) | 54.83 | 137 | 768 |
2846
+ | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 | 335 | 1024 |
2847
 
2848
 
2849
+ Aside from being great open-source models, the largest model, [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/), can serve as a natural replacement for closed-source embedding, as shown below.
2850
 
2851
 
2852
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2853
  | ------------------------------------------------------------------ | -------------------------------- |
2854
+ | [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 |
2855
  | Google-gecko-text-embedding | 55.7 |
2856
  | text-embedding-3-large | 55.44 |
2857
  | Cohere-embed-english-v3.0 | 55.00 |
2858
  | bge-large-en-v1.5 | 54.29 |
2859
 
2860
 
2861
+ ### [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs)
2862
 
2863
 
2864
  This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model with only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
 
2866
 
2867
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2868
  | ------------------------------------------------------------------- | -------------------------------- |
2869
+ | [snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs/) | 50.15 |
2870
  | GIST-all-MiniLM-L6-v2 | 45.12 |
2871
  | gte-tiny | 44.92 |
2872
  | all-MiniLM-L6-v2 | 41.95 |
2873
  | bge-micro-v2 | 42.56 |
2874
 
2875
 
2876
+ ### [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s)
2877
 
2878
 
2879
  Based on the [infloat/e5-small-unsupervised](https://huggingface.co/intfloat/e5-small-unsupervised) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
 
2881
 
2882
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2883
  | ------------------------------------------------------------------ | -------------------------------- |
2884
+ | [snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s/) | 51.98 |
2885
  | bge-small-en-v1.5 | 51.68 |
2886
  | Cohere-embed-english-light-v3.0 | 51.34 |
2887
  | text-embedding-3-small | 51.08 |
2888
  | e5-small-v2 | 49.04 |
2889
 
2890
 
2891
+ ### [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/)
2892
 
2893
 
2894
  Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
 
2896
 
2897
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2898
  | ------------------------------------------------------------------ | -------------------------------- |
2899
+ | [snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m/) | 54.90 |
2900
  | bge-base-en-v1.5 | 53.25 |
2901
  | nomic-embed-text-v1.5 | 53.25 |
2902
  | GIST-Embedding-v0 | 52.31 |
2903
  | gte-base | 52.31 |
2904
 
2905
+ ### [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/)
2906
 
2907
 
2908
  Based on the [nomic-embed-text-v1-unsupervised](https://huggingface.co/nomic-ai/nomic-embed-text-v1-unsupervised) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
 
2910
 
2911
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2912
  | ------------------------------------------------------------------ | -------------------------------- |
2913
+ | [snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/) | 54.83 |
2914
  | nomic-embed-text-v1.5 | 53.01 |
2915
  | nomic-embed-text-v1 | 52.81 |
2916
 
2917
 
2918
 
2919
 
2920
+ ### [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/)
2921
 
2922
 
2923
  Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
 
2925
 
2926
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2927
  | ------------------------------------------------------------------ | -------------------------------- |
2928
+ | [snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l/) | 55.98 |
2929
  | UAE-Large-V1 | 54.66 |
2930
  | bge-large-en-v1.5 | 54.29 |
2931
  | mxbai-embed-large-v1 | 54.39 |
 
2938
  ### Using Huggingface transformers
2939
 
2940
 
2941
+ You can use the transformers package to use an snowflake-arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
2942
 
2943
 
2944
 
 
2946
  import torch
2947
  from transformers import AutoModel, AutoTokenizer
2948
 
2949
+ tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-xs')
2950
+ model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-xs', add_pooling_layer=False)
2951
  model.eval()
2952
 
2953
  query_prefix = 'Represent this sentence for searching relevant passages: '