|
--- |
|
language: en |
|
tags: |
|
- Summarization |
|
license: apache-2.0 |
|
datasets: |
|
- scientific_papers |
|
- big_patent |
|
- cnn_corpus |
|
- cnn_dailymail |
|
- xsum |
|
- MCTI_data |
|
thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png |
|
--- |
|
|
|
 |
|
|
|
|
|
# MCTI Text Automatic Text Summarization Task (uncased) DRAFT |
|
|
|
Disclaimer: |
|
|
|
## According to the abstract of the literature review, |
|
|
|
- We provide a literature review about Automatic Text Summarization systems. We consider a citation-based approach. We start with some popular and well-known |
|
papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we |
|
knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we |
|
present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive |
|
review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical |
|
exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods. |
|
|
|
This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of |
|
summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI). |
|
It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem. |
|
|
|
## Model description |
|
|
|
This Automatic Text Summarizarion (ATS) Model was developed in the Python language to be applied to the Research Financing Products |
|
Portfolio (FPP) of the Brazilian Ministry of Science, Technology and Innovation. It was produced in parallel with the writing of a |
|
Sistematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators |
|
as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation. |
|
|
|
The input of the model can be either a single text, a dataframe or a csv file containing multiple texts (in the English language) and its output |
|
are the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries |
|
for the texts, i.e., human written (or extracted) summaries of the texts which are considered to be good representations of their contents. |
|
Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are, |
|
however, Evaluation Methods which do not deppend on the existence of a golden-summary (e.g. the cosine similarity method, the Kullback Leibler |
|
Divergence method) and this is why an evaluation can be made even when only the text is taken as an input to the model. |
|
|
|
The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document) |
|
or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the |
|
model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from |
|
the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the |
|
methods used for text summarization will be described indvidually in the following sections. |
|
|
|
## Methods |
|
|
|
Since there are many methods to choose from in order to perform the ATS task using this model, the following table presents useful information |
|
regarding each of them, such as what kind of ATS the method produces (extractive or abstractive), where to find the documentation necessary for |
|
its implementation and the article from which it originated. |
|
|
|
| Method | Kind of ATS | Documentation | Source Article | |
|
|:----------------------:|:-----------:|:---------------:|:--------------:| |
|
| SumyRandom | Extractive | [Sumy GitHub](https://github.com/miso-belica/sumy/) | None (picks out random sentences from source text) | |
|
| SumyLuhn | Extractive | Ibid. | [(Luhn, 1958)](http://www.di.ubi.pt/~jpaulo/competence/general/%281958%29Luhn.pdf) | |
|
| SumyLsa | Extractive | Ibid. | [(Steinberger et al., 2004)](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf) | |
|
| SumyLexRank | Extractive | Ibid. | [(Erkan and Radev, 2004)](http://tangra.si.umich.edu/~radev/lexrank/lexrank.pdf) | |
|
| SumyTextRank | Extractive | Ibid. | [(Mihalcea and Tarau, 2004)](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) | |
|
| SumySumBasic | Extractive | Ibid. | [(Vanderwende et. al, 2007)](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf) | |
|
| SumyKL | Extractive | Ibid. | [(Haghighi and Vanderwende, 2009)](http://www.aclweb.org/anthology/N09-1041) | |
|
| SumyReduction | Extractive | Ibid. | None. | |
|
| BART-Large CNN | Abstractive | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) | [(Lewis et al., 2019)](https://arxiv.org/pdf/1910.13461) | |
|
| Pegasus-XSUM | Abstractive | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum) | [(Zhang et al., 2020)](http://proceedings.mlr.press/v119/zhang20ae/zhang20ae.pdf) | |
|
| mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) | |
|
|
|
|
|
## Limitations |
|
|
|
[PERGUNTAR ARTHUR] |
|
|
|
### How to use |
|
|
|
Initially, some libraries will need to be imported in order for the program to work. The following lines |
|
of code, then, are necessary: |
|
|
|
```python |
|
import threading |
|
from alive_progress import alive_bar |
|
from datasets import load_dataset |
|
from bs4 import BeautifulSoup |
|
import pandas as pd |
|
import numpy as np |
|
import shutil |
|
import regex |
|
import os |
|
import re |
|
import itertools as it |
|
import more_itertools as mit |
|
|
|
``` |
|
If any of the above mentioned libraries are not installed in the user's machine, it will be required for |
|
him to install them through the CMD with the comand: |
|
|
|
```python |
|
>>> pip install [LIBRARY] |
|
|
|
``` |
|
|
|
To run the code with given corpus' of data, the following lines of code need to be inserted. If one or multiple |
|
corpora, summarizers and evaluators are not to be applied, the user has to comment the unwanted option. |
|
|
|
```python |
|
if __name__ == "__main__": |
|
|
|
corpora = [ |
|
"mcti_data", |
|
"cnn_dailymail", |
|
"big_patent", |
|
"cnn_corpus_abstractive", |
|
"cnn_corpus_extractive", |
|
"xsum", |
|
"arxiv_pubmed", |
|
] |
|
|
|
summarizers = [ |
|
"SumyRandom", |
|
"SumyLuhn", |
|
"SumyLsa", |
|
"SumyLexRank", |
|
"SumyTextRank", |
|
"SumySumBasic", |
|
"SumyKL", |
|
"SumyReduction", |
|
"Transformers-facebook/bart-large-cnn", |
|
"Transformers-google/pegasus-xsum", |
|
"Transformers-csebuetnlp/mT5_multilingual_XLSum", |
|
] |
|
|
|
metrics = [ |
|
"rouge", |
|
"gensim", |
|
"nltk", |
|
"sklearn", |
|
] |
|
|
|
### Running methods and eval locally |
|
|
|
reader = Data() |
|
reader.show_available_databases() |
|
for corpus in corpora: |
|
data = reader.read_data(corpus, 50) |
|
method = Method(data, corpus) |
|
method.show_methods() |
|
for summarizer in summarizers: |
|
df = method.run(summarizer) |
|
method.examples_to_csv() |
|
evaluator = Evaluator(df, summarizer, corpus) |
|
for metric in metrics: |
|
evaluator.run(metric) |
|
evaluator.metrics_to_csv() |
|
evaluator.join_all_results() |
|
``` |
|
|
|
### Preprocessing |
|
[PERGUNTAR ARTHUR] |
|
|
|
Hey, look how easy it is to write LaTeX equations in here \\(Ax = b\\) or even $ Ax = b $ |
|
## Datasets |
|
|
|
In order to evaluate the model, summaries were generated by each of its summarization methods, which |
|
used as source texts documents achieved from existing datasets. The chosen datasets for evaluation were the following: |
|
|
|
- **Scientific Papers (arXiv + PubMed)**: Cohan et al. (2018) found out that there were only |
|
datasets with short texts (with an average of 600 words) or datasets with longer texts with |
|
extractive humam summaries. In order to fill the gap and to provide a dataset with long text |
|
documents for abstractive summarization, the authors compiled two new datasets with scientific |
|
papers from arXiv and PubMed databases. Scientific papers are specially convenient given the |
|
desired kind of ATS the authors mean to achieve, and that is due to their large length and |
|
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s |
|
abstract. |
|
- **BIGPATENT**: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods |
|
examples for the task of abstractive summarization. The data dataset is built using Google |
|
Patents Public Datasets, where for each document there is one gold-standard summary which |
|
is the patent’s original abstract. One advantage of this dataset is that it does not present |
|
difficulties inherent to news summarization datasets, where summaries have a flattened discourse |
|
structure and the summary content arises in the begining of the document. |
|
- **CNN Corpus**: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news |
|
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus |
|
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one |
|
extractive and one abstractive. The encompassing of extractive gold-standard summaries is |
|
also an advantage of this particular dataset over others with similar goals, which usually only |
|
contain abstractive ones. |
|
- **CNN/Daily Mail**: Hermann et al. (2015) intended to develop a consistent method for what |
|
they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a |
|
text via Natural Language Processing techniques. In order to perform that task, they collected |
|
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered |
|
to be the key aspect in understanding a text, namely the answering of somewhat complex |
|
questions about it. Even though ATS is not the main focus of the authors, they took inspiration |
|
from it to develop their model and include in their dataset the human made summaries for each |
|
news article. |
|
- **XSum**: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a |
|
kind of summarization described by the authors as extreme summarization – an abstractive |
|
kind of ATS that is aimed at answering the question “What is the document about?”. The data |
|
was obtained from BBC articles and each one of them is accompanied by a short gold-standard |
|
summary often written by its very author. |
|
|
|
## Evaluation results |
|
|
|
Each of the datasets' documents was summarized through every summarization method applied in the code and evaluated |
|
in comparison with the gold-standard summaries. |
|
|
|
Table 2: Results from Pre-trained Longformer + ML models. |
|
|
|
| ML Model | Accuracy | F1 Score | Precision | Recall | |
|
|| tentativa |||| |
|
|:--------:|:--------:|:--------:||| |
|
|:--------:|:---------:|:---------:|:---------:|:---------:| |
|
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 | |
|
| DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | |
|
| CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | |
|
| LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 | |
|
|
|
|
|
|
|
## Checkpoints |
|
- Examples |
|
- Implementation Notes |
|
- Usage Example |
|
- >>> |
|
- >>> ... |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@conference{webist22, |
|
author ={Daniel O. Cajueiro and Maísa {Kely de Melo}. and Arthur G. Nery and Silvia A. dos Reis and Igor Tavares |
|
and Li Weigang and Victor R. R. Celestino.}, |
|
title ={A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding}, |
|
booktitle ={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST,}, |
|
year ={2022}, |
|
pages ={}, |
|
publisher ={}, |
|
organization ={}, |
|
doi ={}, |
|
isbn ={}, |
|
issn ={}, |
|
} |
|
``` |
|
|
|
<a href="https://huggingface.co/exbert/?model=bert-base-uncased"> |
|
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png"> |
|
</a> |
|
|