jonatasgrosman commited on
Commit
14c7323
1 Parent(s): 29f1892

update evaluation

Browse files
README.md CHANGED
@@ -1,19 +1,65 @@
1
  ---
2
  language: it
 
3
  datasets:
4
  - common_voice
 
5
  metrics:
6
  - wer
7
  - cer
8
  tags:
 
9
  - audio
10
  - automatic-speech-recognition
11
  - speech
12
  - xlsr-fine-tuning-week
13
- license: apache-2.0
 
14
  model-index:
15
  - name: XLSR Wav2Vec2 Italian by Jonatas Grosman
16
  results:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - task:
18
  name: Speech Recognition
19
  type: automatic-speech-recognition
@@ -109,76 +155,14 @@ for i, predicted_sentence in enumerate(predicted_sentences):
109
 
110
  ## Evaluation
111
 
112
- The model can be evaluated as follows on the Italian test data of Common Voice.
113
-
114
- ```python
115
- import torch
116
- import re
117
- import librosa
118
- from datasets import load_dataset, load_metric
119
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
120
-
121
- LANG_ID = "it"
122
- MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-italian"
123
- DEVICE = "cuda"
124
-
125
- CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
126
- "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
127
- "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
128
- "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
129
- "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
130
-
131
- test_dataset = load_dataset("common_voice", LANG_ID, split="test")
132
-
133
- wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
134
- cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
135
 
136
- chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
137
-
138
- processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
139
- model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
140
- model.to(DEVICE)
141
-
142
- # Preprocessing the datasets.
143
- # We need to read the audio files as arrays
144
- def speech_file_to_array_fn(batch):
145
- with warnings.catch_warnings():
146
- warnings.simplefilter("ignore")
147
- speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
148
- batch["speech"] = speech_array
149
- batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
150
- return batch
151
-
152
- test_dataset = test_dataset.map(speech_file_to_array_fn)
153
-
154
- # Preprocessing the datasets.
155
- # We need to read the audio files as arrays
156
- def evaluate(batch):
157
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
158
-
159
- with torch.no_grad():
160
- logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
161
-
162
- pred_ids = torch.argmax(logits, dim=-1)
163
- batch["pred_strings"] = processor.batch_decode(pred_ids)
164
- return batch
165
-
166
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
167
-
168
- predictions = [x.upper() for x in result["pred_strings"]]
169
- references = [x.upper() for x in result["sentence"]]
170
-
171
- print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
172
- print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
173
  ```
174
 
175
- **Test Result**:
176
-
177
- In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-21). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
178
 
179
- | Model | WER | CER |
180
- | ------------- | ------------- | ------------- |
181
- | jonatasgrosman/wav2vec2-large-xlsr-53-italian | **9.36%** | **2.33%** |
182
- | joorock12/wav2vec2-large-xlsr-italian | 12.60% | 3.18% |
183
- | gchhablani/wav2vec2-large-xlsr-it | 12.99% | 3.11% |
184
- | facebook/wav2vec2-large-xlsr-53-italian | 22.08% | 6.36% |
 
1
  ---
2
  language: it
3
+ license: apache-2.0
4
  datasets:
5
  - common_voice
6
+ - mozilla-foundation/common_voice_6_0
7
  metrics:
8
  - wer
9
  - cer
10
  tags:
11
+ - it
12
  - audio
13
  - automatic-speech-recognition
14
  - speech
15
  - xlsr-fine-tuning-week
16
+ - robust-speech-event
17
+ - mozilla-foundation/common_voice_6_0
18
  model-index:
19
  - name: XLSR Wav2Vec2 Italian by Jonatas Grosman
20
  results:
21
+ - task:
22
+ name: Automatic Speech Recognition
23
+ type: automatic-speech-recognition
24
+ dataset:
25
+ name: Common Voice it
26
+ type: common_voice
27
+ args: it
28
+ metrics:
29
+ - name: Test WER
30
+ type: wer
31
+ value: 9.41
32
+ - name: Test CER
33
+ type: cer
34
+ value: 2.29
35
+ - name: Test WER (+LM)
36
+ type: wer
37
+ value: 6.91
38
+ - name: Test CER (+LM)
39
+ type: cer
40
+ value: 1.83
41
+ - task:
42
+ name: Automatic Speech Recognition
43
+ type: automatic-speech-recognition
44
+ dataset:
45
+ name: Robust Speech Event - Dev Data
46
+ type: speech-recognition-community-v2/dev_data
47
+ args: it
48
+ metrics:
49
+ - name: Test WER
50
+ type: wer
51
+ value: 21.78
52
+ - name: Test CER
53
+ type: cer
54
+ value: 7.94
55
+ - name: Test WER (+LM)
56
+ type: wer
57
+ value: 15.82
58
+ - name: Test CER (+LM)
59
+ type: cer
60
+ value: 6.83
61
+
62
+
63
  - task:
64
  name: Speech Recognition
65
  type: automatic-speech-recognition
 
155
 
156
  ## Evaluation
157
 
158
+ 1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
+ ```bash
161
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset mozilla-foundation/common_voice_6_0 --config it --split test
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  ```
163
 
164
+ 2. To evaluate on `speech-recognition-community-v2/dev_data`
 
 
165
 
166
+ ```bash
167
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset speech-recognition-community-v2/dev_data --config it --split validation --chunk_length_s 5.0 --stride_length_s 1.0
168
+ ```
 
 
 
eval.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ from datasets import load_dataset, load_metric, Audio, Dataset
3
+ from transformers import pipeline, AutoFeatureExtractor, AutoTokenizer, AutoConfig, AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM
4
+ import re
5
+ import torch
6
+ import argparse
7
+ from typing import Dict
8
+
9
+ def log_results(result: Dataset, args: Dict[str, str]):
10
+ """ DO NOT CHANGE. This function computes and logs the result metrics. """
11
+
12
+ log_outputs = args.log_outputs
13
+ dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
14
+
15
+ # load metric
16
+ wer = load_metric("wer")
17
+ cer = load_metric("cer")
18
+
19
+ # compute metrics
20
+ wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
21
+ cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
22
+
23
+ # print & log results
24
+ result_str = (
25
+ f"WER: {wer_result}\n"
26
+ f"CER: {cer_result}"
27
+ )
28
+ print(result_str)
29
+
30
+ with open(f"{dataset_id}_eval_results.txt", "w") as f:
31
+ f.write(result_str)
32
+
33
+ # log all results in text file. Possibly interesting for analysis
34
+ if log_outputs is not None:
35
+ pred_file = f"log_{dataset_id}_predictions.txt"
36
+ target_file = f"log_{dataset_id}_targets.txt"
37
+
38
+ with open(pred_file, "w") as p, open(target_file, "w") as t:
39
+
40
+ # mapping function to write output
41
+ def write_to_file(batch, i):
42
+ p.write(f"{i}" + "\n")
43
+ p.write(batch["prediction"] + "\n")
44
+ t.write(f"{i}" + "\n")
45
+ t.write(batch["target"] + "\n")
46
+
47
+ result.map(write_to_file, with_indices=True)
48
+
49
+
50
+ def normalize_text(text: str, invalid_chars_regex: str, to_lower: bool) -> str:
51
+ """ DO ADAPT FOR YOUR USE CASE. this function normalizes the target text. """
52
+
53
+ text = text.lower() if to_lower else text.upper()
54
+
55
+ text = re.sub(invalid_chars_regex, " ", text)
56
+
57
+ text = re.sub("\s+", " ", text).strip()
58
+
59
+ return text
60
+
61
+
62
+ def main(args):
63
+ # load dataset
64
+ dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
65
+
66
+ # for testing: only process the first two examples as a test
67
+ # dataset = dataset.select(range(10))
68
+
69
+ # load processor
70
+ if args.greedy:
71
+ processor = Wav2Vec2Processor.from_pretrained(args.model_id)
72
+ decoder = None
73
+ else:
74
+ processor = Wav2Vec2ProcessorWithLM.from_pretrained(args.model_id)
75
+ decoder = processor.decoder
76
+
77
+ feature_extractor = processor.feature_extractor
78
+ tokenizer = processor.tokenizer
79
+
80
+ # resample audio
81
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
82
+
83
+ # load eval pipeline
84
+ if args.device is None:
85
+ args.device = 0 if torch.cuda.is_available() else -1
86
+
87
+ config = AutoConfig.from_pretrained(args.model_id)
88
+ model = AutoModelForCTC.from_pretrained(args.model_id)
89
+
90
+ #asr = pipeline("automatic-speech-recognition", model=args.model_id, device=args.device)
91
+ asr = pipeline("automatic-speech-recognition", config=config, model=model, tokenizer=tokenizer,
92
+ feature_extractor=feature_extractor, decoder=decoder, device=args.device)
93
+
94
+ # build normalizer config
95
+ tokenizer = AutoTokenizer.from_pretrained(args.model_id)
96
+ tokens = [x for x in tokenizer.convert_ids_to_tokens(range(0, tokenizer.vocab_size))]
97
+ special_tokens = [
98
+ tokenizer.pad_token, tokenizer.word_delimiter_token,
99
+ tokenizer.unk_token, tokenizer.bos_token,
100
+ tokenizer.eos_token,
101
+ ]
102
+ non_special_tokens = [x for x in tokens if x not in special_tokens]
103
+ invalid_chars_regex = f"[^\s{re.escape(''.join(set(non_special_tokens)))}]"
104
+ normalize_to_lower = False
105
+ for token in non_special_tokens:
106
+ if token.isalpha() and token.islower():
107
+ normalize_to_lower = True
108
+ break
109
+
110
+ # map function to decode audio
111
+ def map_to_pred(batch, args=args, asr=asr, invalid_chars_regex=invalid_chars_regex, normalize_to_lower=normalize_to_lower):
112
+ prediction = asr(batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s)
113
+
114
+ batch["prediction"] = prediction["text"]
115
+ batch["target"] = normalize_text(batch["sentence"], invalid_chars_regex, normalize_to_lower)
116
+ return batch
117
+
118
+ # run inference on all examples
119
+ result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
120
+
121
+ # filtering out empty targets
122
+ result = result.filter(lambda example: example["target"] != "")
123
+
124
+ # compute and log_results
125
+ # do not change function below
126
+ log_results(result, args)
127
+
128
+
129
+ if __name__ == "__main__":
130
+ parser = argparse.ArgumentParser()
131
+
132
+ parser.add_argument(
133
+ "--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
134
+ )
135
+ parser.add_argument(
136
+ "--dataset", type=str, required=True, help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets"
137
+ )
138
+ parser.add_argument(
139
+ "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'` for Common Voice"
140
+ )
141
+ parser.add_argument(
142
+ "--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
143
+ )
144
+ parser.add_argument(
145
+ "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to None. For long audio files a good value would be 5.0 seconds."
146
+ )
147
+ parser.add_argument(
148
+ "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to None. For long audio files a good value would be 1.0 seconds."
149
+ )
150
+ parser.add_argument(
151
+ "--log_outputs", action='store_true', help="If defined, write outputs to log file for analysis."
152
+ )
153
+ parser.add_argument(
154
+ "--greedy", action='store_true', help="If defined, the LM will be ignored during inference."
155
+ )
156
+ parser.add_argument(
157
+ "--device",
158
+ type=int,
159
+ default=None,
160
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
161
+ )
162
+ args = parser.parse_args()
163
+
164
+ main(args)
full_eval.sh ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CV - TEST
2
+
3
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset mozilla-foundation/common_voice_6_0 --config it --split test --log_outputs --greedy
4
+ mv log_mozilla-foundation_common_voice_6_0_it_test_predictions.txt log_mozilla-foundation_common_voice_6_0_it_test_predictions_greedy.txt
5
+ mv mozilla-foundation_common_voice_6_0_it_test_eval_results.txt mozilla-foundation_common_voice_6_0_it_test_eval_results_greedy.txt
6
+
7
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset mozilla-foundation/common_voice_6_0 --config it --split test --log_outputs
8
+
9
+ # HF EVENT - DEV
10
+
11
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset speech-recognition-community-v2/dev_data --config it --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs --greedy
12
+ mv log_speech-recognition-community-v2_dev_data_it_validation_predictions.txt log_speech-recognition-community-v2_dev_data_it_validation_predictions_greedy.txt
13
+ mv speech-recognition-community-v2_dev_data_it_validation_eval_results.txt speech-recognition-community-v2_dev_data_it_validation_eval_results_greedy.txt
14
+
15
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset speech-recognition-community-v2/dev_data --config it --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
log_mozilla-foundation_common_voice_6_0_it_test_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_6_0_it_test_predictions_greedy.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_6_0_it_test_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_it_validation_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_it_validation_predictions_greedy.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_it_validation_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
mozilla-foundation_common_voice_6_0_it_test_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.06910569105691057
2
+ CER: 0.018375194551676556
mozilla-foundation_common_voice_6_0_it_test_eval_results_greedy.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.09417082587814295
2
+ CER: 0.02294050694250536
speech-recognition-community-v2_dev_data_it_validation_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.15828365929356125
2
+ CER: 0.06834318057693263
speech-recognition-community-v2_dev_data_it_validation_eval_results_greedy.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.21789250702005025
2
+ CER: 0.07944822335865481