PWESuite-metric_learner
This is a phonetic word embedding model based on PWESuite, as described in PWESuite: Phonetic Word Embeddings and Tasks They Facilitate. The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces. The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models. These models have been trained on all languages jointly.
Instructions
To run any of the three metric learner models, run:
git clone https://github.com/zouharvi/pwesuite.git
cd pwesuite
mkdir -p computed/models
pip3 install -e .
# download the three models
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt -O computed/models/
Then, in Python, you can run this example script:
from models.metric_learning.model import RNNMetricLearner
from models.metric_learning.preprocessor import preprocess_dataset_foreign
from main.utils import load_multi_data
import torch
import tqdm
import math
data = load_multi_data(purpose_key="all")
data = preprocess_dataset_foreign(data[:10], features="token_ipa")
model = RNNMetricLearner(
dimension=300,
feature_size=data[0][0].shape[1],
)
model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))
# some cheap paralelization
BATCH_SIZE = 32
data_out = []
for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
data_out += list(
model.forward(batch).detach().cpu().numpy()
)
assert len(data) == len(data_out)
assert all([len(x) == 300 for x in data_out])
You can also run the inference on all the data and evaluate it:
mkdir -p computed/embd/
python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
Which gives you an output like:
human_similarity: 0.6054
correlation: 0.8995
retrieval: 0.9158
analogy: 0.1128
rhyme: 0.6375
cognate: 0.6513
JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
Score (overall): 0.6370
Training
Training this model takes about an hour on a mid-tier GPU. See scripts/03-train_metric_learning.sh for the specific training command. Further description TODO.
Other
Cite as:
@inproceedings{zouhar-etal-2024-pwesuite,
title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
author = "Zouhar, Vil{\'e}m and
Chang, Kalvin and
Cui, Chenxuan and
Carlson, Nate B. and
Robinson, Nathaniel Romney and
Sachan, Mrinmaya and
Mortensen, David R.",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1168/",
pages = "13344--13355",
}
Available also on arXiv: https://arxiv.org/abs/2304.02541