---
language:
- en
license: apache-2.0
tags:
- automatic-speech-recognition
---

This repository contains a number of experiments for the [PSST Challenge](https://psst.study/).

As the test set is unavailable, all numbers are based on the validation set.

The experiments in the tables below were finetuned on [Wav2vec 2.0 Base, No finetuning](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec)

Our overall best performing model (**FER** 9\.2%, **PER:** 21\.0%) was based on [Wav2vec 2.0 Large, No finetuning](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec) (git tag: `larger-rir`), with the TIMIT subset augmented with Room Impulse Response, based on the experiments below, on the base model. 

## Augmented TIMIT subset

Using a subset of TIMIT that could map easily to the phoneset used by the PSST Challenge data (a list of IDs are in the repository), we experimented with augmenting the data to better match the PSST data.

The best results were obtained using Room Impulse Response (tag: `rir`)

| **Augmentation**                                 | **FER**   | **PER**    | **Git tag**                         |
| :----------------------------------------------- | :-------- | :--------- | :---------------------------------- |
| unaugmented                                      | 10\.2%    | 22\.5%     | huggingface-unaugmented |
| Gaussian noise                                   | 10\.0%    | 22\.1%     | gaussian                            |
| Pitchshift                                       | 9\.6%     | 22\.9%     | pitchshift                          |
| RIR                                              | **9\.6%** | **21\.8%** | rir                                 |
| Time stretch                                     | 10\.1%    | 22\.8%     | timestretch                         |
| Gaussian noise + RIR                             | 10\.0%    | 23\.4%     | gaussian-rir                        |
| Pitchshift + Gaussian noise                      | 9\.9%     | 22\.9%     | pitchshift-gaussian                 |
| Pitchshift + RIR                                 | 9\.9%     | 22\.8%     | pitchshift-rir                      |
| Tim estretch + Gaussian noise                    | 10\.2%    | 22\.8%     | timestretch-gaussian                |
| Time stretch + Pitchshift                        | 9\.8%     | 22\.0%     | timestretch-pitchshift              |
| Time stretch + RIR                               | 9\.7%     | 22\.2%     | timestretch-rir                     |
| Pitchshift + Gaussian noise + RIR                | 10\.1%    | 23\.5%     | pitchshift-gaussian-rir             |
| Time stretch + Gaussian noise + RIR              | 9\.7%     | 22\.3%     | timestretch-gaussian-rir            |
| Time stretch + Pitchshift + Gaussian noise       | 10\.2%    | 22\.9%     | timestretch-pitchshift-gaussian     |
| Time stretch + Pitchshift + RIR                  | 10\.2%    | 22\.5%     | timestretch-pitchshift-rir          |
| Time stretch + Pitchshift + Gaussian noise + RIR | 10\.9%    | 24\.1%     | timestretch-pitchshift-gaussian-rir |


## LM experiments

We experimented with a number of language model configurations, combining the data from the PSST challenge, the subset of TIMIT we used, and CMUdict.

We tried combining CMUdict data in a number of ways: unmodified, with a silence token added at the start of the pronunciation, at the end, and at both the start and the end.

The best result was from a 5-gram model, with silences added at the end of the CMUdict data (git tag: `lm-nosil-cmudict-sile.5`).

Evaluation was performed using scripts provided by the PSST Challenge's organisers, so there are no scripts in place to automatically use the LM with the transformers library.

|                                | **n-gram** | **FER**    | **PER**    | **Tag** |
| :----------------------------- | :--------- | :--------- | :--------- | :--------- |
| Baseline + TIMIT               | ---          | **10\.2%** | 22\.5%     | huggingface-unaugmented |
| All silences                   | 4          | 10\.5%     | 23\.0%     | lm-allsil.4 |
|                                | 5          | 10\.5%     | 22\.6%     | lm-allsil.5 |
|                                | 6          | 10\.3%     | 22\.3%     | lm-allsil.6 |
| No silences                    | 4          | 10\.3%     | 22\.6%     | lm-nosil.4  |
|                                | 5          | **10\.2%** | 22\.2%     | lm-nosil.5  |
|                                | 6          | **10\.2%** | 22\.4%     | lm-nosil.6  |
| PSST and TIMIT without silence |            |            |            |       |
| Unmodified CMUdict             | 4          | 10\.3%     | 22\.6%     | lm-nosil-cmudict-nosil.4 |
|                                | 5          | 10\.2%     | 22\.2%     | lm-nosil-cmudict-nosil.5 |
|                                | 6          | **10\.2%** | 22\.4%     | lm-nosil-cmudict-nosil.6 |
| CMUdict-end                    | 4          | 10\.3%     | 22\.6%     | lm-nosil-cmudict-sile.4 |
|                                | 5          | **10\.2%** | **22\.1%** | lm-nosil-cmudict-sile.5  |
|                                | 6          | **10\.2%** | 22\.3%     | lm-nosil-cmudict-sile.6 |
| CMUdict-start                  | 4          | 10\.4%     | 22\.6%     | lm-nosil-cmudict-sils.4 |
|                                | 5          | 10\.3%     | 22\.4%     | lm-nosil-cmudict-sils.5 |
|                                | 6          | 10\.3%     | 22\.3%     | lm-nosil-cmudict-sils.6 |
| CMUdict-both                   | 4          | 10\.4%     | 22\.7%     | lm-nosil-cmudict-silb.4 |
|                                | 5          | 10\.4%     | 22\.3%     | lm-nosil-cmudict-silb.5 |
|                                | 6          | 10\.3%     | 22\.3%     | lm-nosil-cmudict-silb.6 |
| Unmodified PSST and TIMIT      |            |            |            | |
| Unmodified CMUdict             | 4          | 10\.3%     | 22\.8%     | lm-orig-cmudict-nosil.4 |
|                                | 5          | 10\.3%     | 22\.4%     | lm-orig-cmudict-nosil.5 |
|                                | 6          | **10\.2%** | 22\.4%     | lm-orig-cmudict-nosil.6 |
| CMUdict-end                    | 4          | 10\.3%     | 22\.7%     | lm-orig-cmudict-sile.4 |
|                                | 5          | **10\.2%** | 22\.2%     | lm-orig-cmudict-sile.5 |
|                                | 6          | **10\.2%** | 22\.3%     | lm-orig-cmudict-sile.6 |
| CMUdict-start                  | 4          | 10\.5%     | 22\.8%     | lm-orig-cmudict-sils.4 |
|                                | 5          | 10\.4%     | 22\.5%     | lm-orig-cmudict-sils.5 |
|                                | 6          | 10\.3%     | 22\.4%     | lm-orig-cmudict-sils.6 |
| CMUdict-both                   | 4          | 10\.5%     | 22\.8%     | lm-orig-cmudict-silb.4 |
|                                | 5          | 10\.4%     | 22\.4%     | lm-orig-cmudict-silb.5 |
|                                | 6          | 10\.4%     | 22\.4%     | lm-orig-cmudict-silb.6 |