Update README.md
Browse files
README.md
CHANGED
@@ -164,7 +164,7 @@ if __name__ == "__main__":
|
|
164 |
## Training data
|
165 |
|
166 |
In order to train the model, it's transformers were trained with five datasets, which were:
|
167 |
-
- Scientific Papers (arXiv + PubMed)
|
168 |
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
169 |
extractive humam summaries. In order to fill the gap and to provide a dataset with long text
|
170 |
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
@@ -172,19 +172,19 @@ papers from arXiv and PubMed databases. Scientific papers are specially convenie
|
|
172 |
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
173 |
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
174 |
abstract.
|
175 |
-
- BIGPATENT
|
176 |
examples for the task of abstractive summarization. The data dataset is built using Google
|
177 |
Patents Public Datasets, where for each document there is one gold-standard summary which
|
178 |
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
179 |
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
180 |
structure and the summary content arises in the begining of the document.
|
181 |
-
- CNN Corpus
|
182 |
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
183 |
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
184 |
extractive and one abstractive. The encompassing of extractive gold-standard summaries is
|
185 |
also an advantage of this particular dataset over others with similar goals, which usually only
|
186 |
contain abstractive ones.
|
187 |
-
- CNN/Daily Mail
|
188 |
they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
|
189 |
text via Natural Language Processing techniques. In order to perform that task, they collected
|
190 |
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
@@ -192,7 +192,7 @@ to be the key aspect in understanding a text, namely the answering of somewhat c
|
|
192 |
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
193 |
from it to develop their model and include in their dataset the human made summaries for each
|
194 |
news article.
|
195 |
-
- XSum
|
196 |
kind of summarization described by the authors as extreme summarization – an abstractive
|
197 |
kind of ATS that is aimed at answering the question “What is the document about?”. The data
|
198 |
was obtained from BBC articles and each one of them is accompanied by a short gold-standard
|
@@ -200,8 +200,11 @@ summary often written by its very author.
|
|
200 |
|
201 |
## Training procedure
|
202 |
|
|
|
203 |
### Preprocessing
|
|
|
204 |
|
|
|
205 |
## Evaluation results
|
206 |
|
207 |
|
|
|
164 |
## Training data
|
165 |
|
166 |
In order to train the model, it's transformers were trained with five datasets, which were:
|
167 |
+
- **Scientific Papers (arXiv + PubMed)**: Cohan et al. (2018) found out that there were only
|
168 |
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
169 |
extractive humam summaries. In order to fill the gap and to provide a dataset with long text
|
170 |
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
|
|
172 |
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
173 |
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
174 |
abstract.
|
175 |
+
- **BIGPATENT**: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
|
176 |
examples for the task of abstractive summarization. The data dataset is built using Google
|
177 |
Patents Public Datasets, where for each document there is one gold-standard summary which
|
178 |
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
179 |
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
180 |
structure and the summary content arises in the begining of the document.
|
181 |
+
- **CNN Corpus**: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
|
182 |
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
183 |
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
184 |
extractive and one abstractive. The encompassing of extractive gold-standard summaries is
|
185 |
also an advantage of this particular dataset over others with similar goals, which usually only
|
186 |
contain abstractive ones.
|
187 |
+
- **CNN/Daily Mail**: Hermann et al. (2015) intended to develop a consistent method for what
|
188 |
they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
|
189 |
text via Natural Language Processing techniques. In order to perform that task, they collected
|
190 |
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
|
|
192 |
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
193 |
from it to develop their model and include in their dataset the human made summaries for each
|
194 |
news article.
|
195 |
+
- **XSum**: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
|
196 |
kind of summarization described by the authors as extreme summarization – an abstractive
|
197 |
kind of ATS that is aimed at answering the question “What is the document about?”. The data
|
198 |
was obtained from BBC articles and each one of them is accompanied by a short gold-standard
|
|
|
200 |
|
201 |
## Training procedure
|
202 |
|
203 |
+
|
204 |
### Preprocessing
|
205 |
+
[PERGUNTAR ARTHUR]
|
206 |
|
207 |
+
Hey, look how easy it is to write LaTeX equations in here \\(Ax = b\\) or even $$Ax = b$$
|
208 |
## Evaluation results
|
209 |
|
210 |
|