Spaces:
Running
Running
File size: 6,493 Bytes
0b8359d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# Multi-doc News Headline Generation Model: NHNet
This repository contains TensorFlow 2.x implementation for NHNet [[1]](#1) as
well as instructions for producing the data we described in the paper.
## Introduction
NHNet is a multi-doc news headline generation model. It extends a standard
Transformer-based encoder-decoder model to multi-doc setting and relies on an
article-level attention layer to capture information common to most (if not all)
input news articles in a news cluster or story, and provide robustness against
potential outliers in the input due to clustering quality.
Our academic paper [[1]](#1) which describes NHNet in detail can be found here:
https://arxiv.org/abs/2001.09386.
## Dataset
**Raw Data:** One can [download](https://github.com/google-research-datasets/NewSHead)
our multi-doc headline dataset which
contains 369,940 news stories and 932,571 unique URLs. We split these stories
into train (359,940 stories), validation (5,000 stories) and test set (5,000
stories) by timestamp.
More information, please checkout:
https://github.com/google-research-datasets/NewSHead
### Crawling
Unfortunately, we will not be able to release the pre-processed dataset that is
exactly used in the paper. Users need to crawl the URLs and the recommended
pre-processing is using an open-sourced library to download and parse the news
content including title and leading paragraphs. For ease of this process, we
provide a config of [news-please](https://github.com/fhamborg/news-please) that
will crawl and extract news articles on a local machine.
First, install the `news-please` CLI (requires python 3.x)
```shell
$ pip3 install news-please
```
Next, run the crawler with our provided [config and URL list](https://github.com/google-research-datasets/NewSHead/releases)
```shell
# Sets to path of the downloaded data folder.
$ DATA_FOLDER=/path/to/downloaded_dataset
# Uses CLI interface to crawl. We assume news_please subfolder contains the
# decompressed config.cfg and sitelist.hjson.
$ news-please -c $DATA_FOLDER/news_please
```
By default, it will store crawled
articles under `/tmp/nhnet/`. To terminate the process press `CTRL+C`.
The crawling may take some days (48 hours in our test) and it depends on the
network environment and #threads set in the config. As the crawling tool won't
stop automatically, it is not straightforward to check the progress. We suggest
to terminate the job if there are no new articles crawled in a short time period
(e.g., 10 minutes) by running
```shell
$ find /tmp/nhnet -type f | wc -l
```
Please note that it is expected that some URLs are no longer available on the
web as time goes by.
### Data Processing
Given the crawled articles under `/tmp/nhnet/`, we would like to transform these
textual articles into a set of `TFRecord` files containing serialized
tensorflow.Example protocol buffers, with feature keys following the BERT
[[2]](#2) tradition but is extended for multiple text segments. We will later
use these processed TFRecords for training and evaluation.
To do this, please first download a [BERT pretrained checkpoint](https://github.com/tensorflow/models/tree/master/official/nlp/bert#access-to-pretrained-checkpoints)
(`BERT-Base,Uncased` preferred for efficiency) and decompress the `tar.gz` file.
We need the vocabulary file and later use the checkpoint for NHNet
initialization.
Next, we can run the following data preprocess script which may take a few hours
to read files and tokenize article content.
```shell
# Recall that we use DATA_FOLDER=/path/to/downloaded_dataset.
$ python3 raw_data_preprocess.py \
-crawled_articles=/tmp/nhnet \
-vocab=/path/to/bert_checkpoint/vocab.txt \
-do_lower_case=True \
-len_title=15 \
-len_passage=200 \
-max_num_articles=5 \
-data_folder=$DATA_FOLDER
```
This python script will export processed train/valid/eval files under
`$DATA_FOLDER/processed/`.
## Training
Please first install TensorFlow 2 and Tensorflow Model Garden following the
[requirments section](https://github.com/tensorflow/models/tree/master/official#requirements).
### CPU/GPU
```shell
$ python3 trainer.py \
--mode=train_and_eval \
--vocab=/path/to/bert_checkpoint/vocab.txt \
--init_checkpoint=/path/to/bert_checkpoint/bert_model.ckpt \
--params_override='init_from_bert2bert=false' \
--train_file_pattern=$DATA_FOLDER/processed/train.tfrecord* \
--model_dir=/path/to/output/model \
--len_title=15 \
--len_passage=200 \
--max_num_articles=5 \
--model_type=nhnet \
--train_batch_size=16 \
--train_steps=10000 \
--steps_per_loop=1 \
--checkpoint_interval=100
```
### TPU
```shell
$ python3 trainer.py \
--mode=train_and_eval \
--vocab=/path/to/bert_checkpoint/vocab.txt \
--init_checkpoint=/path/to/bert_checkpoint/bert_model.ckpt \
--params_override='init_from_bert2bert=false' \
--train_file_pattern=$DATA_FOLDER/processed/train.tfrecord* \
--model_dir=/path/to/output/model \
--len_title=15 \
--len_passage=200 \
--max_num_articles=5 \
--model_type=nhnet \
--train_batch_size=1024 \
--train_steps=10000 \
--steps_per_loop=1000 \
--checkpoint_interval=1000 \
--distribution_strategy=tpu \
--tpu=grpc://${TPU_IP_ADDRESS}:8470
```
In the paper, we train more than 10k steps with batch size set as 1024 with
TPU-v3-64.
Note that, `trainer.py` also supports `train` mode and continuous `eval` mode.
For large scale TPU training, we recommend the have a process running the
`train` mode and another process running the continuous `eval` mode which can
runs on GPUs.
This is the setting we commonly used for large-scale experiments, because `eval`
will be non-blocking to the expensive training load.
### Metrics
**Note: the metrics reported by `evaluation.py` are approximated on
word-piece level rather than the real string tokens. Some metrics like BLEU
scores can be off.**
We will release a colab to evaluate results on string-level soon.
## References
<a id="1">[1]</a> Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You Wu, Cong
Yu, Daniel Finnie, Hongkun Yu, Jiaqi Zhai and Nicholas Zukoski "Generating
Representative Headlines for News Stories": https://arxiv.org/abs/2001.09386.
World Wide Web Conf. (WWW’2020).
<a id="2">[2]</a> Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina
Toutanova "BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding": https://arxiv.org/abs/1810.04805.
|