aps6992 commited on
Commit
3c308cd
·
1 Parent(s): b255d24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -32
README.md CHANGED
@@ -6,17 +6,33 @@ datasets:
6
  - allenai/scirepeval
7
  ---
8
 
9
- # Adapter `allenai/specter2_aug2023refresh_adhoc_query` for `allenai/specter2_aug2023refresh_base`
10
 
11
- An [adapter](https://adapterhub.ml) for the `None` model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
12
 
13
- This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
 
 
 
 
 
14
 
15
  **Dec 2023 Update:**
16
 
17
  Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries.
18
 
19
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## Usage
21
 
22
  First, install `adapters`:
@@ -35,21 +51,6 @@ model = AutoAdapterModel.from_pretrained("allenai/specter2_aug2023refresh_base")
35
  adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
36
  ```
37
 
38
-
39
- **\*\*\*\*\*\*Update\*\*\*\*\*\***
40
-
41
- This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
42
- For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
43
-
44
- # SPECTER 2.0 (Base)
45
- SPECTER 2.0 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
46
- This is the base model to be used along with the adapters.
47
- Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
48
-
49
- **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
50
-
51
- **To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
52
-
53
  # Model Details
54
 
55
  ## Model Description
@@ -63,11 +64,11 @@ Task Formats trained on:
63
  - Proximity
64
  - Adhoc Search
65
 
 
66
 
67
  It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
68
 
69
 
70
-
71
  - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
72
  - **Shared by :** Allen AI
73
  - **Model type:** bert-base-uncased + adapters
@@ -95,32 +96,44 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
95
  |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
96
  |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
97
 
98
- *Retrieval model should suffice for downstream task types not mentioned above
99
 
100
  ```python
101
  from transformers import AutoTokenizer
102
  from adapters import AutoAdapterModel
 
 
 
 
 
 
 
 
 
 
103
 
104
  # load model and tokenizer
105
- tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base')
106
 
107
  #load base model
108
- model = AutoAdapterModel.from_pretrained('allenai/specter2_aug2023refresh_base')
109
 
110
- #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
111
- model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
 
 
112
 
 
 
113
  papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
114
  {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
115
-
116
  # concatenate title and abstract
117
- text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
118
- # preprocess the input
119
- inputs = self.tokenizer(text_batch, padding=True, truncation=True,
120
- return_tensors="pt", return_token_type_ids=False, max_length=512)
121
- output = model(**inputs)
122
- # take the first token in the batch as the embedding
123
- embeddings = output.last_hidden_state[:, 0, :]
124
  ```
125
 
126
  ## Downstream Use
 
6
  - allenai/scirepeval
7
  ---
8
 
9
+ ## SPECTER2
10
 
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
 
13
+ SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
14
+ Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
15
+
16
+ **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
17
+
18
+ **To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
19
 
20
  **Dec 2023 Update:**
21
 
22
  Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries.
23
 
24
 
25
+ **\*\*\*\*\*\*Update\*\*\*\*\*\***
26
+
27
+ This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
28
+ For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
29
+
30
+ # Adapter `allenai/specter2_aug2023refresh_adhoc_query` for `allenai/specter2_aug2023refresh_base`
31
+
32
+ An [adapter](https://adapterhub.ml) for the `None` model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
33
+
34
+ This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
35
+
36
  ## Usage
37
 
38
  First, install `adapters`:
 
51
  adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  # Model Details
55
 
56
  ## Model Description
 
64
  - Proximity
65
  - Adhoc Search
66
 
67
+ **This is the adhoc search query specific adapter. For tasks where papers have to retrieved for a short textual query, use this adapter to encode the query and [allenai/specter2_proximity](https://huggingface.co/allenai/specter2_proximity) to encode the candidates.**
68
 
69
  It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
70
 
71
 
 
72
  - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
73
  - **Shared by :** Allen AI
74
  - **Model type:** bert-base-uncased + adapters
 
96
  |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
97
  |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
98
 
99
+ *Proximity model should suffice for downstream task types not mentioned above
100
 
101
  ```python
102
  from transformers import AutoTokenizer
103
  from adapters import AutoAdapterModel
104
+ from sklearn.metrics.pairwise import euclidean_distances
105
+
106
+ def embed_input(text_batch: List[str]):
107
+ # preprocess the input
108
+ inputs = self.tokenizer(text_batch, padding=True, truncation=True,
109
+ return_tensors="pt", return_token_type_ids=False, max_length=512)
110
+ output = model(**inputs)
111
+ # take the first token in the batch as the embedding
112
+ embeddings = output.last_hidden_state[:, 0, :]
113
+ return embeddings
114
 
115
  # load model and tokenizer
116
+ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base_base')
117
 
118
  #load base model
119
+ model = AutoAdapterModel.from_pretrained('allenai/specter2_aug2023refresh_base_aug2023refresh_base_base')
120
 
121
+ #load the query adapter, provide an identifier for the adapter in load_as argument and activate it
122
+ model.load_adapter("allenai/specter2_aug2023refresh_base_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
123
+ query = ["Bidirectional transformers"]
124
+ query_embedding = embed_input(query)
125
 
126
+ #load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
127
+ model.load_adapter("allenai/specter2_aug2023refresh_base", source="hf", load_as="specter2_proximity", set_active=True)
128
  papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
129
  {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
 
130
  # concatenate title and abstract
131
+ text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
132
+ paper_embeddings = embed_input(text_papers_batch)
133
+
134
+ #Calculate L2 distance between query and papers
135
+ l2_distance = euclidean_distances(papers, query).flatten()
136
+
 
137
  ```
138
 
139
  ## Downstream Use