PWESuite-metric_learner

This is a phonetic word embedding model based on PWESuite, as described in PWESuite: Phonetic Word Embeddings and Tasks They Facilitate. The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces. The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models. These models have been trained on all languages jointly.

Instructions

To run any of the three metric learner models, run:

git clone https://github.com/zouharvi/pwesuite.git
cd pwesuite
mkdir -p computed/models
pip3 install -e .

# download the three models
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt   -O computed/models/

Then, in Python, you can run this example script:


from models.metric_learning.model import RNNMetricLearner
from models.metric_learning.preprocessor import preprocess_dataset_foreign
from main.utils import load_multi_data
import torch
import tqdm
import math

data = load_multi_data(purpose_key="all")
data = preprocess_dataset_foreign(data[:10], features="token_ipa")

model = RNNMetricLearner(
    dimension=300,
    feature_size=data[0][0].shape[1],
)
model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))

# some cheap paralelization
BATCH_SIZE = 32
data_out = []
for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
    batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
    data_out += list(
        model.forward(batch).detach().cpu().numpy()
    )

assert len(data) == len(data_out)
assert all([len(x) == 300 for x in data_out])

You can also run the inference on all the data and evaluate it:

mkdir -p computed/embd/
python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl

Which gives you an output like:

human_similarity: 0.6054
correlation: 0.8995
retrieval: 0.9158
analogy: 0.1128
rhyme: 0.6375
cognate: 0.6513
JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
Score (overall): 0.6370

Training

Training this model takes about an hour on a mid-tier GPU. See scripts/03-train_metric_learning.sh for the specific training command. Further description TODO.

Other

Cite as:

@inproceedings{zouhar-etal-2024-pwesuite,
    title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
    author = "Zouhar, Vil{\'e}m  and
      Chang, Kalvin  and
      Cui, Chenxuan  and
      Carlson, Nate B.  and
      Robinson, Nathaniel Romney  and
      Sachan, Mrinmaya  and
      Mortensen, David R.",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1168/",
    pages = "13344--13355",
}

Available also on arXiv: https://arxiv.org/abs/2304.02541

zouharvi
/

PWESuite-metric_learner

PWESuite-metric_learner

Instructions

Training

Other

Collection including zouharvi/PWESuite-metric_learner

PWESuite