![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg) ![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen) ![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg) # LexNET for Noun Compound Relation Classification This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET algorithm for classifying relationships, specifically applied to classifying the relationships that hold between noun compounds: * *olive oil* is oil that is *made from* olives * *cooking oil* which is oil that is *used for* cooking * *motor oil* is oil that is *contained in* a motor The model is a supervised classifier that predicts the relationship that holds between the constituents of a two-word noun compound using: 1. A neural "paraphrase" of each syntactic dependency path that connects the constituents in a large corpus. For example, given a sentence like *This fine oil is made from first-press olives*, the dependency path is something like `oil from POBJ> olive`. 2. The distributional information provided by the individual words; i.e., the word embeddings of the two consituents. 3. The distributional signal provided by the compound itself; i.e., the embedding of the noun compound in context. The model includes several variants: *path-based model* uses (1) alone, the *distributional model* uses (2) alone, and the *integrated model* uses (1) and (2). The *distributional-nc model* and the *integrated-nc* model each add (3). Training a model requires the following: 1. A collection of noun compounds that have been labeled using a *relation inventory*. The inventory describes the specific relationships that you'd like the model to differentiate (e.g. *part of* versus *composed of* versus *purpose*), and generally may consist of tens of classes. You can download the dataset used in the paper from [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz). 2. A collection of word embeddings: the path-based model uses the word embeddings as part of the path representation, and the distributional models use the word embeddings directly as prediction features. 3. The path-based model requires a collection of syntactic dependency parses that connect the constituents for each noun compound. To generate these, you'll need a corpus from which to train this data; we used Wikipedia and the [LDC GigaWord5](https://catalog.ldc.upenn.edu/LDC2011T07) corpora. # Contents The following source code is included here: * `learn_path_embeddings.py` is a script that trains and evaluates a path-based model to predict a noun-compound relationship given labeled noun-compounds and dependency parse paths. * `learn_classifier.py` is a script that trains and evaluates a classifier based on any combination of paths, word embeddings, and noun-compound embeddings. * `get_indicative_paths.py` is a script that generates the most indicative syntactic dependency paths for a particular relationship. Also included are utilities for preparing data for training: * `text_embeddings_to_binary.py` converts a text file containing word embeddings into a binary file that is quicker to load. * `extract_paths.py` finds all the dependency paths that connect words in a corpus. * `sorted_paths_to_examples.py` processes the output of `extract_paths.py` to produce summarized training data. This code (in particular, the utilities used to prepare the data) differs from the code that was used to prepare data for the paper. Notably, we used a proprietary dependency parser instead of spaCy, which is used here. # Dependencies * [TensorFlow](http://www.tensorflow.org/): see detailed installation instructions at that site. * [SciKit Learn](http://scikit-learn.org/): you can probably just install this with `pip install sklearn`. * [SpaCy](https://spacy.io/): `pip install spacy` ought to do the trick, along with the English model. # Creating the Model This sections described the steps necessary to create and evaluate the model described in the paper. ## Generate Path Data To begin, you need three text files: 1. **Corpus**. This file should contain natural language sentences, written with one sentence per line. For purposes of exposition, we'll assume that you have English Wikipedia serialized this way in `${HOME}/data/wiki.txt`. 2. **Labeled Noun Compound Pairs**. This file contain (modfier, head, label) tuples, tab-separated, with one per line. The *label* represented the relationship between the head and the modifier; e.g., if `purpose` is one your labels, you could possibly include `toothpastepurpose`. 3. **Word Embeddings**. We used the [GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings; in particular the 6B token, 300d variant. We'll assume you have this file as `${HOME}/data/glove.6B.300d.txt`. We first processed the embeddings from their text format into something that we can load a little bit more quickly: ./text_embeddings_to_binary.py \ --input ${HOME}/data/glove.6B.300d.txt \ --output_vocab ${HOME}/data/vocab.txt \ --output_npy ${HOME}/data/glove.6B.300d.npy Next, we'll extract all the dependency parse paths connecting our labeled pairs from the corpus. This process takes a *looooong* time, but is trivially parallelized using map-reduce if you have access to that technology. ./extract_paths.py \ --corpus ${HOME}/data/wiki.txt \ --labeled_pairs ${HOME}/data/labeled-pairs.tsv \ --output ${HOME}/data/paths.tsv The file it produces (`paths.tsv`) is a tab-separated file that contains the modifier, the head, the label, the encoded path, and the sentence from which the path was drawn. (This last is mostly for sanity checking.) A sample row might look something like this (where newlines would actually be tab characters): navy captain owner_emp_use /PROPN/dobj/>::enter/VERB/ROOT/^::follow/VERB/advcl/<::in/ADP/prep/<::footstep/NOUN/pobj/<::of/ADP/prep/<::father/NOUN/pobj/<::bover/PROPN/appos/<::/PROPN/compound/< He entered the Royal Navy following in the footsteps of his father Captain John Bover and two of his elder brothers as volunteer aboard HMS Perseus This file must be sorted as follows: sort -k1,3 -t$'\t' paths.tsv > sorted.paths.tsv In particular, rows with the same modifier, head, and label must appear contiguously. We next create a file that contains all the relation labels from our original labeled pairs: awk 'BEGIN {FS="\t"} {print $3}' < ${HOME}/data/labeled-pairs.tsv \ | sort -u > ${HOME}/data/relations.txt With these in hand, we're ready to produce the train, validation, and test data: ./sorted_paths_to_examples.py \ --input ${HOME}/data/sorted.paths.tsv \ --vocab ${HOME}/data/vocab.txt \ --relations ${HOME}/data/relations.txt \ --splits ${HOME}/data/splits.txt \ --output_dir ${HOME}/data Here, `splits.txt` is a file that indicates which "split" (train, test, or validation) you want the pair to appear in. It should be a tab-separate file which conatins the modifier, head, and the dataset ( `train`, `test`, or `val`) into which the pair should be placed; e.g.,: tooth paste train banana seat test The program will produce a separate file for each dataset split in the directory specified by `--output_dir`. Each file is contains `tf.train.Example` protocol buffers encoded using the `TFRecord` file format. ## Create Path Embeddings Now we're ready to train the path embeddings using `learn_path_embeddings.py`: ./learn_path_embeddings.py \ --train ${HOME}/data/train.tfrecs.gz \ --val ${HOME}/data/val.tfrecs.gz \ --text ${HOME}/data/test.tfrecs.gz \ --embeddings ${HOME}/data/glove.6B.300d.npy --relations ${HOME}/data/relations.txt --output ${HOME}/data/path-embeddings \ --logdir /tmp/learn_path_embeddings The path embeddings will be placed at the location specified by `--output`. ## Train classifiers Train classifiers and evaluate on the validation and test data using `train_classifiers.py` script. This shell script fragment will iterate through each dataset, split, corpus, and model type to train and evaluate classifiers. LOGDIR=/tmp/learn_classifier for DATASET in tratz/fine_grained tratz/coarse_grained ; do for SPLIT in random lexical_head lexical_mod lexical_full ; do for CORPUS in wiki_gigiawords ; do for MODEL in dist dist-nc path integrated integrated-nc ; do # Filename for the log that will contain the classifier results. LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g") python learn_classifier.py \ --dataset_dir ~/lexnet/datasets \ --dataset "${DATASET}" \ --corpus "${SPLIT}/${CORPUS}" \ --embeddings_base_path ~/lexnet/embeddings \ --logdir ${LOGDIR} \ --input "${MODEL}" > "${LOGDIR}/${LOGFILE}" done done done done The log file will contain the final performance (precision, recall, F1) on the train, dev, and test sets, and will include a confusion matrix for each. # Contact If you have any questions, issues, or suggestions, feel free to contact either @vered1986 or @waterson. If you use this code for any published research, please include the following citation: Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model. Vered Shwartz and Chris Waterson. NAACL 2018. [link](https://arxiv.org/pdf/1803.08073.pdf).