|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- mozilla-foundation/common_voice_10_0 |
|
base_model: |
|
- facebook/wav2vec2-xls-r-300m |
|
tags: |
|
- pytorch |
|
- phoneme-recognition |
|
pipeline_tag: automatic-speech-recognition |
|
metrics: |
|
- per |
|
- aer |
|
library_name: allophant |
|
language: |
|
- bn |
|
- ca |
|
- cs |
|
- cv |
|
- da |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- eu |
|
- fi |
|
- fr |
|
- ga |
|
- hi |
|
- hu |
|
- id |
|
- it |
|
- ka |
|
- ky |
|
- lt |
|
- mt |
|
- nl |
|
- pl |
|
- pt |
|
- ro |
|
- ru |
|
- sk |
|
- sl |
|
- sv |
|
- sw |
|
- ta |
|
- tr |
|
- uk |
|
--- |
|
|
|
Model Information |
|
================= |
|
|
|
Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories. |
|
|
|
The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng). |
|
|
|
| Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) | |
|
| ---------------- | ---------: | ---------: | -------: | -------: | |
|
| [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** | |
|
| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% | |
|
| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% | |
|
| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - | |
|
| **Baseline** | 57.01% | - | 46.95% | - | |
|
|
|
Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition. |
|
|
|
Usage |
|
===== |
|
|
|
Install the [`allophant`](https://github.com/kgnlp/allophant) package: |
|
|
|
```bash |
|
pip install allophant |
|
``` |
|
|
|
A pre-trained model can be loaded from a huggingface checkpoint or local file: |
|
|
|
```python |
|
from allophant.estimator import Estimator |
|
|
|
device = "cpu" |
|
model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline", device=device) |
|
supported_features = attribute_indexer.feature_names |
|
# The phonetic feature categories supported by the model, including "phonemes" |
|
print(supported_features) |
|
``` |
|
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways: |
|
|
|
```python |
|
# 1. For a single language: |
|
inventory = attribute_indexer.phoneme_inventory("es") |
|
# 2. For multiple languages, e.g. in code-switching scenarios |
|
inventory = attribute_indexer.phoneme_inventory(["es", "it"]) |
|
# 3. Any custom selection of phones for which features are available in the Allophoible database |
|
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ'] |
|
```` |
|
|
|
Audio files can then be loaded, resampled and transcribed using the given |
|
inventory by first computing the log probabilities for each classifier: |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
from allophant.dataset_processing import Batch |
|
|
|
# Load an audio file and resample the first channel to the sample rate used by the model |
|
audio, sample_rate = torchaudio.load("utterance.wav") |
|
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate) |
|
|
|
# Construct a batch of 0-padded single channel audio, lengths and language IDs |
|
# Language ID can be 0 for inference |
|
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1)) |
|
model_outputs = model.predict( |
|
batch.to(device), |
|
attribute_indexer.composition_feature_matrix(inventory).to(device) |
|
) |
|
``` |
|
|
|
Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features: |
|
|
|
```python |
|
from allophant import predictions |
|
|
|
# Create a feature mapping for your inventory and CTC decoders for the desired feature set |
|
inventory_indexer = attribute_indexer.attributes.subset(inventory) |
|
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features) |
|
|
|
for feature_name, decoder in ctc_decoders.items(): |
|
decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths) |
|
# Print the feature name and values for each utterance in the batch |
|
for [hypothesis] in decoded: |
|
# NOTE: token indices are offset by one due to the <BLANK> token used during decoding |
|
recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1) |
|
print(feature_name, recognized) |
|
``` |
|
|
|
Citation |
|
======== |
|
|
|
```bibtex |
|
@inproceedings{glocker2023allophant, |
|
title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes}, |
|
author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir}, |
|
year={2023}, |
|
booktitle={{Proc. Interspeech 2023}}, |
|
month={8}} |
|
``` |
|
|
|
[](arxiv.org/abs/2306.04306) |