zpn commited on
Commit
a6571e5
·
verified ·
1 Parent(s): 0759316

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -6
README.md CHANGED
@@ -2663,14 +2663,70 @@ Training data to train the models is released in its entirety. For more details,
2663
 
2664
  ## Usage
2665
 
2666
- Note `nomic-embed-text` *requires* prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
2667
- For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2668
 
2669
- For example, you are building a RAG application over the top of Wikipedia. You would embed all Wikipedia articles with the prefix `search_document`
2670
- and any questions you ask with `search_query`. For example:
2671
  ```python
2672
- queries = ["search_query: who is the first president of the united states?", "search_query: when was babe ruth born?"]
2673
- documents = ["search_document: <article about US Presidents>", "search_document: <article about Babe Ruth>"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2674
  ```
2675
 
2676
  ### Sentence Transformers
 
2663
 
2664
  ## Usage
2665
 
2666
+ **Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
2667
+
2668
+ For example, if you are implementing a RAG application, you embed your documents as `search_document: <text here>` and embed your user queries as `search_query: <text here>`.
2669
+
2670
+ ## Task instruction prefixes
2671
+
2672
+ ### `search_document`
2673
+
2674
+ #### Purpose: embed texts as documents from a dataset
2675
+
2676
+ This prefix is used for embedding texts as documents, for example as documents for a RAG index.
2677
+
2678
+ ```python
2679
+ from sentence_transformers import SentenceTransformer
2680
+
2681
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2682
+ sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
2683
+ embeddings = model.encode(sentences)
2684
+ print(embeddings)
2685
+ ```
2686
+
2687
+ ### `search_query`
2688
+
2689
+ #### Purpose: embed texts as questions to answer
2690
+
2691
+ This prefix is used for embedding texts as questions that documents from a dataset could resolve, for example as queries to be answered by a RAG application.
2692
 
 
 
2693
  ```python
2694
+ from sentence_transformers import SentenceTransformer
2695
+
2696
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2697
+ sentences = ['search_query: Who is Laurens van Der Maaten?']
2698
+ embeddings = model.encode(sentences)
2699
+ print(embeddings)
2700
+ ```
2701
+
2702
+ ### `clustering`
2703
+
2704
+ #### Purpose: embed texts to group them into clusters
2705
+
2706
+ This prefix is used for embedding texts in order to group them into clusters, discover common topics, or remove semantic duplicates.
2707
+
2708
+ ```python
2709
+ from sentence_transformers import SentenceTransformer
2710
+
2711
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2712
+ sentences = ['clustering: the quick brown fox']
2713
+ embeddings = model.encode(sentences)
2714
+ print(embeddings)
2715
+ ```
2716
+
2717
+ ### `classification`
2718
+
2719
+ #### Purpose: embed texts to classify them
2720
+
2721
+ This prefix is used for embedding texts into vectors that will be used as features for a classification model
2722
+
2723
+ ```python
2724
+ from sentence_transformers import SentenceTransformer
2725
+
2726
+ model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
2727
+ sentences = ['classification: the quick brown fox']
2728
+ embeddings = model.encode(sentences)
2729
+ print(embeddings)
2730
  ```
2731
 
2732
  ### Sentence Transformers