allenai
/

specter2_aug2023refresh_adhoc_query

Adapters

bert

Model card Files Files and versions Community

aps6992 commited on Dec 14, 2023

Commit

3c308cd

1 Parent(s): b255d24

Update README.md

Browse files

Files changed (1) hide show

README.md +45 -32

README.md CHANGED Viewed

@@ -6,17 +6,33 @@ datasets:
 - allenai/scirepeval
 ---
-# Adapter `allenai/specter2_aug2023refresh_adhoc_query` for `allenai/specter2_aug2023refresh_base`
-An [adapter](https://adapterhub.ml) for the `None` model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
-This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
 **Dec 2023 Update:**
 Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries.
 ## Usage
 First, install `adapters`:
@@ -35,21 +51,6 @@ model = AutoAdapterModel.from_pretrained("allenai/specter2_aug2023refresh_base")
 adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
 ```
-**\*\*\*\*\*\*Update\*\*\*\*\*\***
-This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
-For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
-# SPECTER 2.0 (Base)
-SPECTER 2.0 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
-This is the base model to be used along with the adapters.
-Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
-**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
-**To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
 # Model Details
 ## Model Description
@@ -63,11 +64,11 @@ Task Formats trained on:
 - Proximity
 - Adhoc Search
 It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
 - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
 - **Shared by :** Allen AI
 - **Model type:** bert-base-uncased + adapters
@@ -95,32 +96,44 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
 |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
 |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
-*Retrieval model should suffice for downstream task types not mentioned above
 ```python
 from transformers import AutoTokenizer
 from adapters import AutoAdapterModel
 # load model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base')
 #load base model
-model = AutoAdapterModel.from_pretrained('allenai/specter2_aug2023refresh_base')
-#load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
-model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
 papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
           {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
 # concatenate title and abstract
-text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
-# preprocess the input
-inputs = self.tokenizer(text_batch, padding=True, truncation=True,
-                                   return_tensors="pt", return_token_type_ids=False, max_length=512)
-output = model(**inputs)
-# take the first token in the batch as the embedding
-embeddings = output.last_hidden_state[:, 0, :]
 ```
 ## Downstream Use

 - allenai/scirepeval
 ---
+## SPECTER2
+<!-- Provide a quick summary of what the model is/does. -->
+SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
+Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
+**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
+**To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
 **Dec 2023 Update:**
 Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries.
+**\*\*\*\*\*\*Update\*\*\*\*\*\***
+This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
+For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
+# Adapter `allenai/specter2_aug2023refresh_adhoc_query` for `allenai/specter2_aug2023refresh_base`
+An [adapter](https://adapterhub.ml) for the `None` model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
+This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
 ## Usage
 First, install `adapters`:
 adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
 ```
 # Model Details
 ## Model Description
 - Proximity
 - Adhoc Search
+**This is the adhoc search query specific adapter. For tasks where papers have to retrieved for a short textual query, use this adapter to encode the query and [allenai/specter2_proximity](https://huggingface.co/allenai/specter2_proximity) to encode the candidates.**
 It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
 - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
 - **Shared by :** Allen AI
 - **Model type:** bert-base-uncased + adapters
 |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
 |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
+*Proximity model should suffice for downstream task types not mentioned above
 ```python
 from transformers import AutoTokenizer
 from adapters import AutoAdapterModel
+from sklearn.metrics.pairwise import euclidean_distances
+def embed_input(text_batch: List[str]):
+  # preprocess the input
+  inputs = self.tokenizer(text_batch, padding=True, truncation=True,
+                                   return_tensors="pt", return_token_type_ids=False, max_length=512)
+  output = model(**inputs)
+  # take the first token in the batch as the embedding
+  embeddings = output.last_hidden_state[:, 0, :]
+  return embeddings
 # load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base_base')
 #load base model
+model = AutoAdapterModel.from_pretrained('allenai/specter2_aug2023refresh_base_aug2023refresh_base_base')
+#load the query adapter, provide an identifier for the adapter in load_as argument and activate it
+model.load_adapter("allenai/specter2_aug2023refresh_base_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
+query = ["Bidirectional transformers"]
+query_embedding = embed_input(query)
+#load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
+model.load_adapter("allenai/specter2_aug2023refresh_base", source="hf", load_as="specter2_proximity", set_active=True)
 papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
           {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
 # concatenate title and abstract
+text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
+paper_embeddings = embed_input(text_papers_batch)
+#Calculate L2 distance between query and papers
+l2_distance = euclidean_distances(papers, query).flatten()
 ```
 ## Downstream Use