--- tags: - bert - adapter-transformers datasets: - allenai/scirepeval --- ## SPECTER2 SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_). Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications. **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).** **To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.** **Dec 2023 Update:** Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries. **\*\*\*\*\*\*Update\*\*\*\*\*\*** This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers. For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base). # Adapter `allenai/specter2_aug2023refresh_adhoc_query` for `allenai/specter2_aug2023refresh_base` An [adapter](https://adapterhub.ml) for the `None` model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset. This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library. ## Usage First, install `adapters`: ``` pip install -U adapters ``` _Note: adapters is built as an add-on to transformers that acts as a drop-in replacement with adapter support. [More](https://docs.adapterhub.ml)_ Now, the adapter can be loaded and activated like this: ```python from adapters import AutoAdapterModel model = AutoAdapterModel.from_pretrained("allenai/specter2_aug2023refresh_base") adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True) ``` # Model Details ## Model Description SPECTER 2.0 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation). Post that it is trained with additionally attached task format specific adapter modules on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks. Task Formats trained on: - Classification - Regression - Proximity - Adhoc Search **This is the adhoc search query specific adapter. For tasks where papers have to retrieved for a short textual query, use this adapter to encode the query and [allenai/specter2_proximity](https://huggingface.co/allenai/specter2_proximity) to encode the candidates.** It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well. - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman - **Shared by :** Allen AI - **Model type:** bert-base-uncased + adapters - **License:** Apache 2.0 - **Finetuned from model:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased). ## Model Sources - **Repository:** [https://github.com/allenai/SPECTER2_0](https://github.com/allenai/SPECTER2_0) - **Paper:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137) - **Demo:** [Usage](https://github.com/allenai/SPECTER2_0/blob/main/README.md) # Uses ## Direct Use |Model|Name and HF link|Description| |--|--|--| |Proximity*|[allenai/specter2_aug2023refresh](https://huggingface.co/allenai/specter2_aug2023refresh)|Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search| |Adhoc Query|[allenai/specter2_aug2023refresh_adhoc_query](https://huggingface.co/allenai/specter2_aug2023refresh_adhoc_query)|Encode short raw text queries for search tasks. (Candidate papers can be encoded with the proximity adapter)| |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features| |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features| *Proximity model should suffice for downstream task types not mentioned above ```python from transformers import AutoTokenizer from adapters import AutoAdapterModel from sklearn.metrics.pairwise import euclidean_distances def embed_input(text_batch: List[str]): # preprocess the input inputs = self.tokenizer(text_batch, padding=True, truncation=True, return_tensors="pt", return_token_type_ids=False, max_length=512) output = model(**inputs) # take the first token in the batch as the embedding embeddings = output.last_hidden_state[:, 0, :] return embeddings # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base_base') #load base model model = AutoAdapterModel.from_pretrained('allenai/specter2_aug2023refresh_base_aug2023refresh_base_base') #load the query adapter, provide an identifier for the adapter in load_as argument and activate it model.load_adapter("allenai/specter2_aug2023refresh_base_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True) query = ["Bidirectional transformers"] query_embedding = embed_input(query) #load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it model.load_adapter("allenai/specter2_aug2023refresh_base", source="hf", load_as="specter2_proximity", set_active=True) papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] # concatenate title and abstract text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] paper_embeddings = embed_input(text_papers_batch) #Calculate L2 distance between query and papers l2_distance = euclidean_distances(papers, query).flatten() ``` ## Downstream Use For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md). # Training Details ## Training Data The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats. All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval). The citation link are triplets in the form ```json {"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}} ``` consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation. ## Training Procedure Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677). ### Training Hyperparameters The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md): - Base Model: First a base model is trained on the above citation triplets. ``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16``` - Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well. ``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16``` # Evaluation We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset. We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark. |Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)| |--|--|--|--|--| |[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)| |[SPECTER](https://huggingface.co/allenai/specter)|54.7|57.4|68.0|(30.6, 25.5)| |[SciNCL](https://huggingface.co/malteos/scincl)|55.6|57.8|69.0|(32.6, 27.3)| |[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|59.0|70.9|(35.3, 29.6)| |[SPECTER 2.0-Adapters](https://huggingface.co/models?search=allenai/specter-2)|**62.3**|**59.2**|**71.2**|**(38.4, 33.0)**| Please cite the following works if you end up using SPECTER 2.0: [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677): ```bibtex @inproceedings{specter2020cohan, title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}}, author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld}, booktitle={ACL}, year={2020} } ``` [SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137) ```bibtex @article{Singh2022SciRepEvalAM, title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations}, author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman}, journal={ArXiv}, year={2022}, volume={abs/2211.13308} } ```