DewiBrynJones commited on
Commit
bc79fa0
1 Parent(s): dc71ee6

Approx. 4000 hours YT data

Browse files
Files changed (3) hide show
  1. README.md +2 -11
  2. config.json +2 -2
  3. model.safetensors +3 -0
README.md CHANGED
@@ -14,16 +14,7 @@ This model is experimental in investigating pretraining better models with more
14
 
15
  https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
16
 
17
- This initial base model has been pre-trained with scripts at
18
 
19
- https://github.com/techiaith/docker-wav2vec2-cy/tree/main/train/pre-train
20
-
21
- using English speech from LibriSpeech's minimal subsets (`validation` and `test`), and 184 hours and 47 minutes of Welsh speech from various playlists on YouTube. The script [`build_youtube_playlists_corpus.sh`](https://github.com/techiaith/docker-wav2vec2-cy/blob/main/inference/python/build_youtube_playlists_corpus.sh) lists the playlists used.
22
-
23
- Until we have collected thousands of hours of Welsh speech, rather than hundreds, the WER scores, after fine-tuning, will remain very high. The following WERs are from tests on a Welsh Common Voice test set as well a [second set of YouTube videos with corrected transcriptions](https://git.techiaith.bangor.ac.uk/data-porth-technolegau-iaith/corpws-profi-adnabod-lleferydd/-/tree/master/data/trawsgrifio).
24
-
25
- | Test Set | WER | CER | WER (+LM) | CER (+LM)|
26
- | -------- | --- | --- | --------- | -------- |
27
- | CV CY 10 | 94.83 | 85.55 | 92.31 | 82.25 |
28
- | YouTube | 95.43 | 90.26 | 93.60 | 89.33 |
29
 
 
14
 
15
  https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
16
 
17
+ This base model has been pre-trained with only approximately 4000 hours of Welsh and English speech collected from various channels on YouTube. The corpus contains only 25% Welsh language speech. English language speech contains Welsh-accented English speech and therefore has been retained for pre-training.
18
 
19
+ Until we have collected many more hours of speech, this pre-trained model will be of limited use for fine-tuning any useful downstream tasks.
 
 
 
 
 
 
 
 
 
20
 
config.json CHANGED
@@ -1,5 +1,6 @@
1
  {
2
  "activation_dropout": 0.0,
 
3
  "adapter_kernel_size": 3,
4
  "adapter_stride": 2,
5
  "add_adapter": false,
@@ -51,7 +52,6 @@
51
  "feat_proj_dropout": 0.0,
52
  "feat_quantizer_dropout": 0.0,
53
  "final_dropout": 0.0,
54
- "gradient_checkpointing": false,
55
  "hidden_act": "gelu",
56
  "hidden_dropout": 0.0,
57
  "hidden_dropout_prob": 0.0,
@@ -101,7 +101,7 @@
101
  1
102
  ],
103
  "torch_dtype": "float32",
104
- "transformers_version": "4.21.0",
105
  "use_weighted_layer_sum": false,
106
  "vocab_size": 32,
107
  "xvector_output_dim": 512
 
1
  {
2
  "activation_dropout": 0.0,
3
+ "adapter_attn_dim": null,
4
  "adapter_kernel_size": 3,
5
  "adapter_stride": 2,
6
  "add_adapter": false,
 
52
  "feat_proj_dropout": 0.0,
53
  "feat_quantizer_dropout": 0.0,
54
  "final_dropout": 0.0,
 
55
  "hidden_act": "gelu",
56
  "hidden_dropout": 0.0,
57
  "hidden_dropout_prob": 0.0,
 
101
  1
102
  ],
103
  "torch_dtype": "float32",
104
+ "transformers_version": "4.38.2",
105
  "use_weighted_layer_sum": false,
106
  "vocab_size": 32,
107
  "xvector_output_dim": 512
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1567f48173b19ed1f1e2c9c76fad74d2a6ed662c39128a00be62cca1dd2c9ba7
3
+ size 380246024