Integrate Sentence Transformers, prevent manual tokenizer EOS

by tomaarsen HF staff - opened 3 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+121

-23

Files changed (8) hide show

1_Pooling/config.json +10 -0
README.md +58 -3
config_sentence_transformers.json +12 -0
modeling_drama.py +8 -18
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +7 -0
tokenizer.json +2 -2

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 1024,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md CHANGED Viewed

@@ -23,6 +23,8 @@ language:
 - yo
 pipeline_tag: sentence-similarity
 library_name: transformers
 ---
 # DRAMA-large (0.3B): Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
@@ -36,7 +38,10 @@ Please check our [paper](https://arxiv.org/abs/2502.18460) for the detials.
 ## Usage
-Below is an example using `drama-large` to encode query and document examples from the MIRACL dataset:
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
@@ -62,10 +67,8 @@ doc_embs = model.encode_documents(tokenizer, documents)
 scores = query_embs @ doc_embs.T
 print(scores.tolist())
 # Expected output: [[0.5429, 0.1109], [0.1317, 0.6074]]
 ```
 > The `trust_remote_code` will use our customized `drama_modeling.py` with two details:
 >- We use bi-directional attention instead of uni-directional attention
 >- We add `"Query: "` as prefix for query text. (No prefix added to document)
@@ -81,6 +84,58 @@ print(scores.tolist())
 # Expected output: [[0.6239, 0.2294], [0.2604, 0.6942]]
 ```
 ## Evaluation
 The model has been evaluated on multiple retrieval benchmarks, including [BEIR](https://github.com/beir-cellar/beir), [MIRACL](https://github.com/project-miracl/miracl), [MLDR](https://huggingface.co/datasets/Shitao/MLDR), and several multilingual retrieval tasks in [MTEB](https://github.com/embeddings-benchmark/mteb).

 - yo
 pipeline_tag: sentence-similarity
 library_name: transformers
+tags:
+- sentence-transformers
 ---
 # DRAMA-large (0.3B): Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
 ## Usage
+Below is an example using `drama-base` to encode query and document examples from the MIRACL dataset, using either Transformers or Sentence Transformers:
+### Transformers
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
 scores = query_embs @ doc_embs.T
 print(scores.tolist())
 # Expected output: [[0.5429, 0.1109], [0.1317, 0.6074]]
 ```
 > The `trust_remote_code` will use our customized `drama_modeling.py` with two details:
 >- We use bi-directional attention instead of uni-directional attention
 >- We add `"Query: "` as prefix for query text. (No prefix added to document)
 # Expected output: [[0.6239, 0.2294], [0.2604, 0.6942]]
 ```
+### Sentence Transformers
+```python
+from sentence_transformers import SentenceTransformer
+queries = [
+    'What percentage of the Earth\'s atmosphere is oxygen?',
+    '意大利首都是哪里？',
+]
+documents = [
+    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
+    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
+]
+model = SentenceTransformer("facebook/drama-large", trust_remote_code=True)
+query_embs = model.encode(queries, prompt_name="query")
+doc_embs = model.encode(documents)
+scores = model.similarity(query_embs, doc_embs)
+print(scores.tolist())
+# Expected output: [[0.5429, 0.1109], [0.1317, 0.6074]]
+```
+>- The `trust_remote_code` will use our customized `drama_modeling.py` which uses bi-directional attention instead of uni-directional attention.
+>- For queries, you have to use `prompt_name="query"` to select the [prompt called "query"](config_sentence_transformers.json), or `prompt="Query: "` to specify the prompt string manually.
+DRAMA models are trained using Matryoshka Representation Learning ([MRL](https://github.com/RAIVNLab/MRL)) to support flexible dimensionality. Both queries and documents can be encoded into smaller dimensions, such as 256, using the following:
+```python
+from sentence_transformers import SentenceTransformer
+queries = [
+    'What percentage of the Earth\'s atmosphere is oxygen?',
+    '意大利首都是哪里？',
+]
+documents = [
+    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
+    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
+]
+model = SentenceTransformer("facebook/drama-large", truncate_dim=256, trust_remote_code=True)
+query_embs = model.encode(queries, prompt_name="query")
+doc_embs = model.encode(documents)
+scores = model.similarity(query_embs, doc_embs)
+print(scores.tolist())
+# Expected output: [[0.6239, 0.2294], [0.2604, 0.6942]]
+```
 ## Evaluation
 The model has been evaluated on multiple retrieval benchmarks, including [BEIR](https://github.com/beir-cellar/beir), [MIRACL](https://github.com/project-miracl/miracl), [MLDR](https://huggingface.co/datasets/Shitao/MLDR), and several multilingual retrieval tasks in [MTEB](https://github.com/embeddings-benchmark/mteb).

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.4.0",
+    "transformers": "4.48.3",
+    "pytorch": "2.5.0+cu121"
+  },
+  "prompts": {
+    "query": "Query: "
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

modeling_drama.py CHANGED Viewed

@@ -72,27 +72,16 @@ class DramaModel(LlamaModel):
             max_seq_len = self.max_seq_len
         tokenized = tokenizer(
             texts,
-            padding=False,
             truncation=True,
-            max_length=max_seq_len - 1,
-            return_attention_mask=False,
-            return_token_type_ids=False,
-            add_special_tokens=True
-        )
-        tokenized['input_ids'] = [
-            t + [tokenizer.eos_token_id] for t in tokenized['input_ids']
-        ]
-        tokenized = tokenizer.pad(
-            tokenized,
-            padding=True,
-            return_attention_mask=True,
             return_tensors='pt',
         ).to(self.device)
         return tokenized
-    def forward(self, input_ids, attention_mask, dim, *args, **kwargs):
         """
-        Forward pass through the model.
         Args:
             input_ids (torch.Tensor): Input token IDs.
@@ -102,7 +91,7 @@ class DramaModel(LlamaModel):
         Returns:
             torch.Tensor: Normalized output embeddings.
         """
-        outputs = super().forward(
             input_ids, attention_mask, *args, **kwargs
         )
         embeddings = self._average_pool(
@@ -141,7 +130,7 @@ class DramaModel(LlamaModel):
             raise ValueError(f"dim must be in range [1, {self.hidden_size}].")
         queries = [self.query_prefix + query for query in queries]
         tokenized_queries = self._tokenize(tokenizer, queries, max_seq_len)
-        embeddings = self(**tokenized_queries, dim=dim)
         return embeddings
     def encode_documents(
@@ -172,5 +161,6 @@ class DramaModel(LlamaModel):
         if dim is not None and (dim < 1 or dim > self.hidden_size):
             raise ValueError(f"dim must be in range [1, {self.hidden_size}].")
         tokenized_documents = self._tokenize(tokenizer, documents, max_seq_len)
-        embeddings = self(**tokenized_documents, dim=dim)
         return embeddings

             max_seq_len = self.max_seq_len
         tokenized = tokenizer(
             texts,
+            padding=True,
             truncation=True,
+            max_length=max_seq_len,
             return_tensors='pt',
         ).to(self.device)
         return tokenized
+    def encode(self, input_ids, attention_mask, dim, *args, **kwargs):
         """
+        Pass through the model and compute normalized embeddings.
         Args:
             input_ids (torch.Tensor): Input token IDs.
         Returns:
             torch.Tensor: Normalized output embeddings.
         """
+        outputs = self.forward(
             input_ids, attention_mask, *args, **kwargs
         )
         embeddings = self._average_pool(
             raise ValueError(f"dim must be in range [1, {self.hidden_size}].")
         queries = [self.query_prefix + query for query in queries]
         tokenized_queries = self._tokenize(tokenizer, queries, max_seq_len)
+        embeddings = self.encode(**tokenized_queries, dim=dim)
         return embeddings
     def encode_documents(
         if dim is not None and (dim < 1 or dim > self.hidden_size):
             raise ValueError(f"dim must be in range [1, {self.hidden_size}].")
         tokenized_documents = self._tokenize(tokenizer, documents, max_seq_len)
+        embeddings = self.encode(**tokenized_documents, dim=dim)
         return embeddings

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 8192,
+  "do_lower_case": false
+}

special_tokens_map.json CHANGED Viewed

@@ -12,5 +12,12 @@
     "normalized": false,
     "rstrip": false,
     "single_word": false
   }
 }

     "normalized": false,
     "rstrip": false,
     "single_word": false
+  },
+  "pad_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
   }
 }

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
-size 17209920

 version https://git-lfs.github.com/spec/v1
+oid sha256:6c18e1797510535655f962df0669fcb7d10b325b5d0eb4b51be36789dcf5fcaf
+size 17210533