Spaces:
Running
Running
# Cross-View Training | |
This repository contains code for *Semi-Supervised Sequence Modeling with Cross-View Training*. Currently sequence tagging and dependency parsing tasks are supported. | |
## Requirements | |
* [Tensorflow](https://www.tensorflow.org/) | |
* [Numpy](http://www.numpy.org/) | |
This code has been run with TensorFlow 1.10.1 and Numpy 1.14.5; other versions may work, but have not been tested. | |
## Fetching and Preprocessing Data | |
Run `fetch_data.sh` to download and extract pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) vectors, the [1 Billion Word Language Model Benchmark](http://www.statmt.org/lm-benchmark/) corpus of unlabeled data, and the CoNLL-2000 [text chunking](https://www.clips.uantwerpen.be/conll2000/chunking/) dataset. Unfortunately the other datasets from our paper are not freely available and so can't be included in this repository. | |
To apply CVT to other datasets, the data should be placed in `data/raw_data/<task_name>/(train|dev|test).txt`. For sequence tagging data, each line should contain a word followed by a space followed by that word's tag. Sentences should be separated by empty lines. For dependency parsing, each tag should be of the form ``<index_of_head>-<relation>`` (e.g., `0-root`). | |
After all of the data has been downloaded, run `preprocessing.py`. | |
## Training a Model | |
Run `python cvt.py --mode=train --model_name=chunking_model`. By default this trains a model on the chunking data downloaded with `fetch_data.sh`. To change which task(s) are trained on or model hyperparameters, modify [base/configure.py](base/configure.py). Models are automatically checkpointed every 1000 steps; training will continue from the latest checkpoint if training is interrupted and restarted. Model checkpoints and other data such as dev set accuracy over time are stored in `data/models/<model_name>`. | |
## Evaluating a Model | |
Run `python cvt.py --mode=eval --model_name=chunking_model`. A CVT model trained on the chunking data for 200k steps should get at least 97.1 F1 on the dev set and 96.6 F1 on the test set. | |
## Citation | |
If you use this code for your publication, please cite the original paper: | |
``` | |
@inproceedings{clark2018semi, | |
title = {Semi-Supervised Sequence Modeling with Cross-View Training}, | |
author = {Kevin Clark and Minh-Thang Luong and Christopher D. Manning and Quoc V. Le}, | |
booktitle = {EMNLP}, | |
year = {2018} | |
} | |
``` | |
## Contact | |
* [Kevin Clark](https://cs.stanford.edu/~kevclark/) (@clarkkev). | |
* [Thang Luong](https://nlp.stanford.edu/~lmthang/) (@lmthang). | |