unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

English

Summarization

Model card Files Files and versions Community

igorgavi commited on Dec 16, 2022

Commit

a9ce3f2

1 Parent(s): 6b232ca

Update README.md

Browse files

Files changed (1) hide show

README.md +35 -7

README.md CHANGED Viewed

@@ -76,7 +76,7 @@ its implementation and the article from which it originated.
 ## Limitations
 ### How to use
@@ -163,12 +163,40 @@ if __name__ == "__main__":
 ## Training data
-The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
-unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
-headers).
 ## Training procedure

 ## Limitations
+[PERGUNTAR ARTHUR]
 ### How to use
 ## Training data
+In order to train the model, it's transformers were trained with five datasets, which were:
+- Scientific Papers (arXiv + PubMed): Cohan et al. (2018) found out that there were only
+datasets with short texts (with an average of 600 words) or datasets with longer texts with
+extractive humam summaries. In order to fill the gap and to provide a dataset with long text
+documents for abstractive summarization, the authors compiled two new datasets with scientific
+papers from arXiv and PubMed databases. Scientific papers are specially convenient given the
+desired kind of ATS the authors mean to achieve, and that is due to their large length and
+the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
+abstract.
+- BIGPATENT: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
+examples for the task of abstractive summarization. The data dataset is built using Google
+Patents Public Datasets, where for each document there is one gold-standard summary which
+is the patent’s original abstract. One advantage of this dataset is that it does not present
+difficulties inherent to news summarization datasets, where summaries have a flattened discourse
+structure and the summary content arises in the begining of the document.
+- CNN Corpus: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
+summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
+dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
+extractive and one abstractive. The encompassing of extractive gold-standard summaries is
+also an advantage of this particular dataset over others with similar goals, which usually only
+contain abstractive ones.
+- CNN/Daily Mail: Hermann et al. (2015) intended to develop a consistent method for what
+they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
+text via Natural Language Processing techniques. In order to perform that task, they collected
+around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
+to be the key aspect in understanding a text, namely the answering of somewhat complex
+questions about it. Even though ATS is not the main focus of the authors, they took inspiration
+from it to develop their model and include in their dataset the human made summaries for each
+news article.
+- XSum: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
+kind of summarization described by the authors as extreme summarization – an abstractive
+kind of ATS that is aimed at answering the question “What is the document about?”. The data
+was obtained from BBC articles and each one of them is accompanied by a short gold-standard
+summary often written by its very author.
 ## Training procedure