Update README.md
Browse files
README.md
CHANGED
@@ -76,7 +76,7 @@ its implementation and the article from which it originated.
|
|
76 |
|
77 |
## Limitations
|
78 |
|
79 |
-
|
80 |
|
81 |
### How to use
|
82 |
|
@@ -163,12 +163,40 @@ if __name__ == "__main__":
|
|
163 |
|
164 |
## Training data
|
165 |
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
172 |
|
173 |
## Training procedure
|
174 |
|
|
|
76 |
|
77 |
## Limitations
|
78 |
|
79 |
+
[PERGUNTAR ARTHUR]
|
80 |
|
81 |
### How to use
|
82 |
|
|
|
163 |
|
164 |
## Training data
|
165 |
|
166 |
+
In order to train the model, it's transformers were trained with five datasets, which were:
|
167 |
+
- Scientific Papers (arXiv + PubMed): Cohan et al. (2018) found out that there were only
|
168 |
+
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
169 |
+
extractive humam summaries. In order to fill the gap and to provide a dataset with long text
|
170 |
+
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
171 |
+
papers from arXiv and PubMed databases. Scientific papers are specially convenient given the
|
172 |
+
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
173 |
+
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
174 |
+
abstract.
|
175 |
+
- BIGPATENT: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
|
176 |
+
examples for the task of abstractive summarization. The data dataset is built using Google
|
177 |
+
Patents Public Datasets, where for each document there is one gold-standard summary which
|
178 |
+
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
179 |
+
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
180 |
+
structure and the summary content arises in the begining of the document.
|
181 |
+
- CNN Corpus: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
|
182 |
+
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
183 |
+
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
184 |
+
extractive and one abstractive. The encompassing of extractive gold-standard summaries is
|
185 |
+
also an advantage of this particular dataset over others with similar goals, which usually only
|
186 |
+
contain abstractive ones.
|
187 |
+
- CNN/Daily Mail: Hermann et al. (2015) intended to develop a consistent method for what
|
188 |
+
they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
|
189 |
+
text via Natural Language Processing techniques. In order to perform that task, they collected
|
190 |
+
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
191 |
+
to be the key aspect in understanding a text, namely the answering of somewhat complex
|
192 |
+
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
193 |
+
from it to develop their model and include in their dataset the human made summaries for each
|
194 |
+
news article.
|
195 |
+
- XSum: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
|
196 |
+
kind of summarization described by the authors as extreme summarization – an abstractive
|
197 |
+
kind of ATS that is aimed at answering the question “What is the document about?”. The data
|
198 |
+
was obtained from BBC articles and each one of them is accompanied by a short gold-standard
|
199 |
+
summary often written by its very author.
|
200 |
|
201 |
## Training procedure
|
202 |
|