metadata
datasets:
- midas/krapivin
- midas/inspec
- midas/kptimes
- midas/duc2001
language:
- en
widget:
- text: >-
Relevance has traditionally been linked with feature subset selection, but
formalization of this link has not been attempted. In this paper, we
propose two axioms for feature subset selection sufficiency axiom and
necessity axiombased on which this link is formalized: The expected
feature subset is the one which maximizes relevance. Finding the expected
feature subset turns out to be NP-hard. We then devise a heuristic
algorithm to find the expected subset which has a polynomial time
complexity. The experimental results show that the algorithm finds good
enough subset of features which, when presented to C4.5, results in better
prediction accuracy.
- text: >-
In this paper, we investigate cross-domain limitations of keyphrase
generation using the models for abstractive text summarization. We present
an evaluation of BART fine-tuned for keyphrase generation across three
types of texts, namely scientific texts from computer science and
biomedical domains and news texts. We explore the role of transfer
learning between different domains to improve the model performance on
small text corpora.
BART fine-tuned for keyphrase generation
This is the bart-base (Lewis et al.. 2019) model finetuned for the keyphrase generation task (Glazkova & Morozov, 2023) on the fragments of the following corpora:
- Krapivin (Krapivin et al., 2009)
- Inspec (Hulth, 2003)
- KPTimes (Gallina, 2019)
- DUC-2001 (Wan, 2008)
- PubMed (Schutz, 2008)
- NamedKeys (Gero & Ho, 2019).
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("aglazkova/bart_finetuned_keyphrase_extraction")
model = AutoModelForSeq2SeqLM.from_pretrained("aglazkova/bart_finetuned_keyphrase_extraction")
text = "In this paper, we investigate cross-domain limitations of keyphrase generation using the models for abstractive text summarization.\
We present an evaluation of BART fine-tuned for keyphrase generation across three types of texts, \
namely scientific texts from computer science and biomedical domains and news texts. \
We explore the role of transfer learning between different domains to improve the model performance on small text corpora."
tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print(translated_text)
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 4e-5
- train_batch_size: 8
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- num_epochs: 6
BibTeX:
@InProceedings{10.1007/978-3-031-67826-4_19,
author="Glazkova, Anna
and Morozov, Dmitry",
title="Cross-Domain Robustness of Transformer-Based Keyphrase Generation",
booktitle="Data Analytics and Management in Data Intensive Domains",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="249--265"
}