Spaces:
Running
Running
File size: 5,833 Bytes
0b8359d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)
# A Simple Method for Commonsense Reasoning
This repository contains code to reproduce results from [*A Simple Method for Commonsense Reasoning*](https://arxiv.org/abs/1806.02847).
Authors and contact:
* Trieu H. Trinh (thtrieu@google.com, github: thtrieu)
* Quoc V. Le (qvl@google.com)
## TL;DR
Commonsense reasoning is a long-standing challenge for deep learning. For example,
it is difficult to use neural networks to tackle the Winograd Schema dataset - a difficult subset of Pronoun Disambiguation problems. In this work, we use language models to score substitued sentences to decide the correct reference of the ambiguous pronoun (see Figure below for an example).
![Figure 1. Overview of our method.](method.jpg)
This simple unsupervised method achieves new state-of-the-art (*as of June 1st, 2018*) results on both benchmark PDP-60 and WSC-273 (See Table below), without using rule-based reasoning nor expensive annotated knowledge bases.
| Commonsense-reasoning test | Previous best result | Ours |
| ----------------------------|:----------------------:|:-----:|
| Pronoun Disambiguation | 66.7% | 70% |
| Winograd Schema Challenge | 52.8% | 63.7% |
## Citation
If you use our released models below in your publication, please cite the original paper:
@article{TBD}
## Requirements
* Python >=2.6
* Tensorflow >= v1.4
* Numpy >= 1.12.1
## Details of this release
The open-sourced components include:
* Test sets from Pronoun Disambiguation Problem (PDP-60) and Winograd Schema Challenges (WSC-273).
* Tensorflow metagraph and checkpoints of 14 language models (See Appendix A in the paper).
* A vocabulary file.
* Code to reproduce results from the original paper.
## How to run
### 1. Download data files
Download all files from the [Google Cloud Storage of this project](https://console.cloud.google.com/storage/browser/commonsense-reasoning/). The easiest way is to install and use `gsutil cp` command-line tool (See [install gsutil](https://cloud.google.com/storage/docs/gsutil_install)).
```shell
# Download everything from the project gs://commonsense-reasoning
$ gsutil cp -R gs://commonsense-reasoning/* .
Copying gs://commonsense-reasoning/reproduce/vocab.txt...
Copying gs://commonsense-reasoning/reproduce/commonsense_test/pdp60.json...
Copying gs://commonsense-reasoning/reproduce/commonsense_test/wsc273.json...
...(omitted)
```
All downloaded content should be in `./reproduce/`. This includes two tests `pdp60.json` and `wsc273.json`, a vocabulary file `vocab.txt` and checkpoints for all 14 language models, each includes three files (`.data`, `.index` and `.meta`). All checkpoint names start with `ckpt-best` since they are saved at the best perplexity on a hold-out text corpus.
```shell
# Check for the content
$ ls reproduce/*
reproduce/vocab.txt
reproduce/commonsense_test:
pdp60.json wsc273.json
reproduce/lm01:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm02:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm03:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm04:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm05:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm06:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm07:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm08:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm09:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm10:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm11:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm12:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm13:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
reproduce/lm14:
ckpt-best.data-00000-of-00001 ckpt-best.index ckpt-best.meta
```
### 2. Run evaluation code
To reproduce results from the paper, simply run `eval.py` script.
```shell
$ python eval.py --data_dir=reproduce
Restored from ./reproduce/lm01
Reset RNN states.
Processing patch (1, 1) / (2, 4)
Probs for
[['Then' 'Dad' 'figured' ..., 'man' "'s" 'board-bill']
['Then' 'Dad' 'figured' ..., 'man' "'s" 'board-bill']
['Always' 'before' ',' ..., 'now' ',' 'for']
...,
['Mark' 'was' 'close' ..., 'promising' 'him' ',']
['Mark' 'was' 'close' ..., 'promising' 'him' ',']
['Mark' 'was' 'close' ..., 'promising' 'him' ',']]
=
[[ 1.64250596e-05 1.77780055e-06 4.14267970e-06 ..., 1.87315454e-03
1.57723188e-01 6.31845817e-02]
[ 1.64250596e-05 1.77780055e-06 4.14267970e-06 ..., 1.87315454e-03
1.57723188e-01 6.31845817e-02]
[ 1.28243030e-07 3.80435935e-03 1.12383246e-01 ..., 9.67682712e-03
2.17407525e-01 1.08243264e-01]
...,
[ 1.15557734e-04 2.92792241e-03 3.46455898e-04 ..., 2.72328052e-05
3.37066874e-02 7.89367408e-02]
[ 1.15557734e-04 2.92792241e-03 3.46455898e-04 ..., 2.72328052e-05
3.37066874e-02 7.89367408e-02]
[ 1.15557734e-04 2.92792241e-03 3.46455898e-04 ..., 2.72328052e-05
3.37066874e-02 7.89367408e-02]]
Processing patch (1, 2) / (2, 4)
...(omitted)
Accuracy of 1 LM(s) on pdp60 = 0.6
...(omitted)
Accuracy of 5 LM(s) on pdp60 = 0.7
...(omitted)
Accuracy of 10 LM(s) on wsc273 = 0.615
...(omitted)
Accuracy of 14 LM(s) on wsc273 = 0.637
```
|