Update README.md
Browse files
README.md
CHANGED
@@ -6,17 +6,33 @@ datasets:
|
|
6 |
- allenai/scirepeval
|
7 |
---
|
8 |
|
9 |
-
|
10 |
|
11 |
-
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
**Dec 2023 Update:**
|
16 |
|
17 |
Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries.
|
18 |
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
## Usage
|
21 |
|
22 |
First, install `adapters`:
|
@@ -35,21 +51,6 @@ model = AutoAdapterModel.from_pretrained("allenai/specter2_aug2023refresh_base")
|
|
35 |
adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
|
36 |
```
|
37 |
|
38 |
-
|
39 |
-
**\*\*\*\*\*\*Update\*\*\*\*\*\***
|
40 |
-
|
41 |
-
This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
|
42 |
-
For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
|
43 |
-
|
44 |
-
# SPECTER 2.0 (Base)
|
45 |
-
SPECTER 2.0 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
|
46 |
-
This is the base model to be used along with the adapters.
|
47 |
-
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
|
48 |
-
|
49 |
-
**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
|
50 |
-
|
51 |
-
**To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
|
52 |
-
|
53 |
# Model Details
|
54 |
|
55 |
## Model Description
|
@@ -63,11 +64,11 @@ Task Formats trained on:
|
|
63 |
- Proximity
|
64 |
- Adhoc Search
|
65 |
|
|
|
66 |
|
67 |
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
|
68 |
|
69 |
|
70 |
-
|
71 |
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
|
72 |
- **Shared by :** Allen AI
|
73 |
- **Model type:** bert-base-uncased + adapters
|
@@ -95,32 +96,44 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
|
|
95 |
|Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
|
96 |
|Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
|
97 |
|
98 |
-
*
|
99 |
|
100 |
```python
|
101 |
from transformers import AutoTokenizer
|
102 |
from adapters import AutoAdapterModel
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
103 |
|
104 |
# load model and tokenizer
|
105 |
-
tokenizer = AutoTokenizer.from_pretrained('allenai/
|
106 |
|
107 |
#load base model
|
108 |
-
model = AutoAdapterModel.from_pretrained('allenai/
|
109 |
|
110 |
-
#load the adapter
|
111 |
-
model.load_adapter("allenai/
|
|
|
|
|
112 |
|
|
|
|
|
113 |
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
|
114 |
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
|
115 |
-
|
116 |
# concatenate title and abstract
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
embeddings = output.last_hidden_state[:, 0, :]
|
124 |
```
|
125 |
|
126 |
## Downstream Use
|
|
|
6 |
- allenai/scirepeval
|
7 |
---
|
8 |
|
9 |
+
## SPECTER2
|
10 |
|
11 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
12 |
|
13 |
+
SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
|
14 |
+
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
|
15 |
+
|
16 |
+
**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
|
17 |
+
|
18 |
+
**To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
|
19 |
|
20 |
**Dec 2023 Update:**
|
21 |
|
22 |
Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries.
|
23 |
|
24 |
|
25 |
+
**\*\*\*\*\*\*Update\*\*\*\*\*\***
|
26 |
+
|
27 |
+
This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
|
28 |
+
For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
|
29 |
+
|
30 |
+
# Adapter `allenai/specter2_aug2023refresh_adhoc_query` for `allenai/specter2_aug2023refresh_base`
|
31 |
+
|
32 |
+
An [adapter](https://adapterhub.ml) for the `None` model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
|
33 |
+
|
34 |
+
This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
|
35 |
+
|
36 |
## Usage
|
37 |
|
38 |
First, install `adapters`:
|
|
|
51 |
adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
|
52 |
```
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
# Model Details
|
55 |
|
56 |
## Model Description
|
|
|
64 |
- Proximity
|
65 |
- Adhoc Search
|
66 |
|
67 |
+
**This is the adhoc search query specific adapter. For tasks where papers have to retrieved for a short textual query, use this adapter to encode the query and [allenai/specter2_proximity](https://huggingface.co/allenai/specter2_proximity) to encode the candidates.**
|
68 |
|
69 |
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
|
70 |
|
71 |
|
|
|
72 |
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
|
73 |
- **Shared by :** Allen AI
|
74 |
- **Model type:** bert-base-uncased + adapters
|
|
|
96 |
|Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
|
97 |
|Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
|
98 |
|
99 |
+
*Proximity model should suffice for downstream task types not mentioned above
|
100 |
|
101 |
```python
|
102 |
from transformers import AutoTokenizer
|
103 |
from adapters import AutoAdapterModel
|
104 |
+
from sklearn.metrics.pairwise import euclidean_distances
|
105 |
+
|
106 |
+
def embed_input(text_batch: List[str]):
|
107 |
+
# preprocess the input
|
108 |
+
inputs = self.tokenizer(text_batch, padding=True, truncation=True,
|
109 |
+
return_tensors="pt", return_token_type_ids=False, max_length=512)
|
110 |
+
output = model(**inputs)
|
111 |
+
# take the first token in the batch as the embedding
|
112 |
+
embeddings = output.last_hidden_state[:, 0, :]
|
113 |
+
return embeddings
|
114 |
|
115 |
# load model and tokenizer
|
116 |
+
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base_base')
|
117 |
|
118 |
#load base model
|
119 |
+
model = AutoAdapterModel.from_pretrained('allenai/specter2_aug2023refresh_base_aug2023refresh_base_base')
|
120 |
|
121 |
+
#load the query adapter, provide an identifier for the adapter in load_as argument and activate it
|
122 |
+
model.load_adapter("allenai/specter2_aug2023refresh_base_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
|
123 |
+
query = ["Bidirectional transformers"]
|
124 |
+
query_embedding = embed_input(query)
|
125 |
|
126 |
+
#load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
|
127 |
+
model.load_adapter("allenai/specter2_aug2023refresh_base", source="hf", load_as="specter2_proximity", set_active=True)
|
128 |
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
|
129 |
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
|
|
|
130 |
# concatenate title and abstract
|
131 |
+
text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
|
132 |
+
paper_embeddings = embed_input(text_papers_batch)
|
133 |
+
|
134 |
+
#Calculate L2 distance between query and papers
|
135 |
+
l2_distance = euclidean_distances(papers, query).flatten()
|
136 |
+
|
|
|
137 |
```
|
138 |
|
139 |
## Downstream Use
|