Spaces:

NCTCMumbai
/

NCTC

Running

App Files Files Community

NCTC / models /research /cvt_text /README.md

NCTCMumbai

Upload 2571 files

0b8359d over 1 year ago

preview code

raw

history blame

2.53 kB

	# Cross-View Training

	This repository contains code for Semi-Supervised Sequence Modeling with Cross-View Training. Currently sequence tagging and dependency parsing tasks are supported.

	## Requirements
	* [Tensorflow](https://www.tensorflow.org/)
	* [Numpy](http://www.numpy.org/)

	This code has been run with TensorFlow 1.10.1 and Numpy 1.14.5; other versions may work, but have not been tested.

	## Fetching and Preprocessing Data
	Run `fetch_data.sh` to download and extract pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) vectors, the [1 Billion Word Language Model Benchmark](http://www.statmt.org/lm-benchmark/) corpus of unlabeled data, and the CoNLL-2000 [text chunking](https://www.clips.uantwerpen.be/conll2000/chunking/) dataset. Unfortunately the other datasets from our paper are not freely available and so can't be included in this repository.

	To apply CVT to other datasets, the data should be placed in `data/raw_data/<task_name>/(train\|dev\|test).txt`. For sequence tagging data, each line should contain a word followed by a space followed by that word's tag. Sentences should be separated by empty lines. For dependency parsing, each tag should be of the form ``<index_of_head>-<relation>`` (e.g., `0-root`).

	After all of the data has been downloaded, run `preprocessing.py`.

	## Training a Model
	Run `python cvt.py --mode=train --model_name=chunking_model`. By default this trains a model on the chunking data downloaded with `fetch_data.sh`. To change which task(s) are trained on or model hyperparameters, modify [base/configure.py](base/configure.py). Models are automatically checkpointed every 1000 steps; training will continue from the latest checkpoint if training is interrupted and restarted. Model checkpoints and other data such as dev set accuracy over time are stored in `data/models/<model_name>`.

	## Evaluating a Model
	Run `python cvt.py --mode=eval --model_name=chunking_model`. A CVT model trained on the chunking data for 200k steps should get at least 97.1 F1 on the dev set and 96.6 F1 on the test set.

	## Citation
	If you use this code for your publication, please cite the original paper:
	```
	@inproceedings{clark2018semi,
	title = {Semi-Supervised Sequence Modeling with Cross-View Training},
	author = {Kevin Clark and Minh-Thang Luong and Christopher D. Manning and Quoc V. Le},
	booktitle = {EMNLP},
	year = {2018}
	}
	```

	## Contact
	* [Kevin Clark](https://cs.stanford.edu/~kevclark/) (@clarkkev).
	* [Thang Luong](https://nlp.stanford.edu/~lmthang/) (@lmthang).