yangwang825 commited on
Commit
355f046
1 Parent(s): bb30cbe

Upload config

Browse files
Files changed (3) hide show
  1. README.md +199 -0
  2. config.json +130 -0
  3. configuration_wavlm_spkreg.py +333 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./wavlm-base",
3
+ "activation_dropout": 0.0,
4
+ "adapter_kernel_size": 3,
5
+ "adapter_stride": 2,
6
+ "add_adapter": false,
7
+ "apply_spec_augment": true,
8
+ "architectures": [
9
+ "WavLMModel"
10
+ ],
11
+ "attention_dropout": 0.1,
12
+ "auto_map": {
13
+ "AutoConfig": "configuration_wavlm_spkreg.WavLMSpkRegConfig"
14
+ },
15
+ "bos_token_id": 1,
16
+ "classifier_proj_size": 256,
17
+ "codevector_dim": 256,
18
+ "contrastive_logits_temperature": 0.1,
19
+ "conv_bias": false,
20
+ "conv_dim": [
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512,
25
+ 512,
26
+ 512,
27
+ 512
28
+ ],
29
+ "conv_kernel": [
30
+ 10,
31
+ 3,
32
+ 3,
33
+ 3,
34
+ 3,
35
+ 2,
36
+ 2
37
+ ],
38
+ "conv_stride": [
39
+ 5,
40
+ 2,
41
+ 2,
42
+ 2,
43
+ 2,
44
+ 2,
45
+ 2
46
+ ],
47
+ "ctc_loss_reduction": "sum",
48
+ "ctc_zero_infinity": false,
49
+ "diversity_loss_weight": 0.1,
50
+ "do_stable_layer_norm": false,
51
+ "easy_margin": false,
52
+ "eos_token_id": 2,
53
+ "feat_extract_activation": "gelu",
54
+ "feat_extract_norm": "group",
55
+ "feat_proj_dropout": 0.1,
56
+ "feat_quantizer_dropout": 0.0,
57
+ "final_dropout": 0.0,
58
+ "freeze_feat_extract_train": true,
59
+ "hidden_act": "gelu",
60
+ "hidden_dropout": 0.1,
61
+ "hidden_size": 768,
62
+ "initializer_range": 0.02,
63
+ "intermediate_size": 3072,
64
+ "label_smoothing": 0.0,
65
+ "layer_norm_eps": 1e-05,
66
+ "layerdrop": 0.05,
67
+ "loss_fct": "cross_entropy",
68
+ "margin": 0.35,
69
+ "mask_channel_length": 10,
70
+ "mask_channel_min_space": 1,
71
+ "mask_channel_other": 0.0,
72
+ "mask_channel_prob": 0.0,
73
+ "mask_channel_selection": "static",
74
+ "mask_feature_length": 10,
75
+ "mask_feature_min_masks": 0,
76
+ "mask_feature_prob": 0.0,
77
+ "mask_time_length": 10,
78
+ "mask_time_min_masks": 2,
79
+ "mask_time_min_space": 1,
80
+ "mask_time_other": 0.0,
81
+ "mask_time_prob": 0.05,
82
+ "mask_time_selection": "static",
83
+ "max_bucket_distance": 800,
84
+ "model_type": "wavlm",
85
+ "no_mask_channel_overlap": false,
86
+ "no_mask_time_overlap": false,
87
+ "num_adapter_layers": 3,
88
+ "num_attention_heads": 12,
89
+ "num_buckets": 320,
90
+ "num_codevector_groups": 2,
91
+ "num_codevectors_per_group": 320,
92
+ "num_conv_pos_embedding_groups": 16,
93
+ "num_conv_pos_embeddings": 128,
94
+ "num_ctc_classes": 80,
95
+ "num_feat_extract_layers": 7,
96
+ "num_hidden_layers": 12,
97
+ "num_negatives": 100,
98
+ "output_hidden_size": 768,
99
+ "pad_token_id": 0,
100
+ "proj_codevector_dim": 256,
101
+ "reduction": "mean",
102
+ "scale": 30.0,
103
+ "tdnn_dilation": [
104
+ 1,
105
+ 2,
106
+ 3,
107
+ 1,
108
+ 1
109
+ ],
110
+ "tdnn_dim": [
111
+ 512,
112
+ 512,
113
+ 512,
114
+ 512,
115
+ 1500
116
+ ],
117
+ "tdnn_kernel": [
118
+ 5,
119
+ 3,
120
+ 3,
121
+ 1,
122
+ 1
123
+ ],
124
+ "tokenizer_class": "Wav2Vec2CTCTokenizer",
125
+ "torch_dtype": "float32",
126
+ "transformers_version": "4.46.2",
127
+ "use_weighted_layer_sum": false,
128
+ "vocab_size": 32,
129
+ "xvector_output_dim": 512
130
+ }
configuration_wavlm_spkreg.py ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """WavLM model configuration"""
2
+
3
+ import functools
4
+ import operator
5
+
6
+ from transformers.configuration_utils import PretrainedConfig
7
+ from transformers.utils import logging
8
+
9
+ logger = logging.get_logger(__name__)
10
+
11
+
12
+ class WavLMSpkRegConfig(PretrainedConfig):
13
+ r"""
14
+ This is the configuration class to store the configuration of a [`WavLMModel`]. It is used to instantiate an WavLM
15
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
16
+ defaults will yield a similar configuration to that of the WavLM
17
+ [microsoft/wavlm-base](https://huggingface.co/microsoft/wavlm-base) architecture.
18
+
19
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
20
+ documentation from [`PretrainedConfig`] for more information.
21
+
22
+
23
+ Args:
24
+ vocab_size (`int`, *optional*, defaults to 32):
25
+ Vocabulary size of the WavLM model. Defines the number of different tokens that can be represented by the
26
+ `inputs_ids` passed when calling [`WavLMModel`]. Vocabulary size of the model. Defines the different tokens
27
+ that can be represented by the *inputs_ids* passed to the forward method of [`WavLMModel`].
28
+ hidden_size (`int`, *optional*, defaults to 768):
29
+ Dimensionality of the encoder layers and the pooler layer.
30
+ num_hidden_layers (`int`, *optional*, defaults to 12):
31
+ Number of hidden layers in the Transformer encoder.
32
+ num_attention_heads (`int`, *optional*, defaults to 12):
33
+ Number of attention heads for each attention layer in the Transformer encoder.
34
+ intermediate_size (`int`, *optional*, defaults to 3072):
35
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
36
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
37
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
38
+ `"relu"`, `"selu"` and `"gelu_new"` are supported.
39
+ hidden_dropout (`float`, *optional*, defaults to 0.1):
40
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
41
+ activation_dropout (`float`, *optional*, defaults to 0.1):
42
+ The dropout ratio for activations inside the fully connected layer.
43
+ attention_dropout (`float`, *optional*, defaults to 0.1):
44
+ The dropout ratio for the attention probabilities.
45
+ final_dropout (`float`, *optional*, defaults to 0.1):
46
+ The dropout probability for the final projection layer of [`WavLMForCTC`].
47
+ layerdrop (`float`, *optional*, defaults to 0.1):
48
+ The LayerDrop probability. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more
49
+ details.
50
+ initializer_range (`float`, *optional*, defaults to 0.02):
51
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
52
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
53
+ The epsilon used by the layer normalization layers.
54
+ feat_extract_norm (`str`, *optional*, defaults to `"group"`):
55
+ The norm to be applied to 1D convolutional layers in feature encoder. One of `"group"` for group
56
+ normalization of only the first 1D convolutional layer or `"layer"` for layer normalization of all 1D
57
+ convolutional layers.
58
+ feat_proj_dropout (`float`, *optional*, defaults to 0.0):
59
+ The dropout probability for output of the feature encoder.
60
+ feat_extract_activation (`str, `optional`, defaults to `"gelu"`):
61
+ The non-linear activation function (function or string) in the 1D convolutional layers of the feature
62
+ extractor. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
63
+ conv_dim (`Tuple[int]` or `List[int]`, *optional*, defaults to `(512, 512, 512, 512, 512, 512, 512)`):
64
+ A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
65
+ feature encoder. The length of *conv_dim* defines the number of 1D convolutional layers.
66
+ conv_stride (`Tuple[int]` or `List[int]`, *optional*, defaults to `(5, 2, 2, 2, 2, 2, 2)`):
67
+ A tuple of integers defining the stride of each 1D convolutional layer in the feature encoder. The length
68
+ of *conv_stride* defines the number of convolutional layers and has to match the length of *conv_dim*.
69
+ conv_kernel (`Tuple[int]` or `List[int]`, *optional*, defaults to `(10, 3, 3, 3, 3, 3, 3)`):
70
+ A tuple of integers defining the kernel size of each 1D convolutional layer in the feature encoder. The
71
+ length of *conv_kernel* defines the number of convolutional layers and has to match the length of
72
+ *conv_dim*.
73
+ conv_bias (`bool`, *optional*, defaults to `False`):
74
+ Whether the 1D convolutional layers have a bias.
75
+ num_conv_pos_embeddings (`int`, *optional*, defaults to 128):
76
+ Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
77
+ embeddings layer.
78
+ num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
79
+ Number of groups of 1D convolutional positional embeddings layer.
80
+ do_stable_layer_norm (`bool`, *optional*, defaults to `False`):
81
+ Whether to apply *stable* layer norm architecture of the Transformer encoder. `do_stable_layer_norm is
82
+ True` corresponds to applying layer norm before the attention layer, whereas `do_stable_layer_norm is
83
+ False` corresponds to applying layer norm after the attention layer.
84
+ apply_spec_augment (`bool`, *optional*, defaults to `True`):
85
+ Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see
86
+ [SpecAugment: A Simple Data Augmentation Method for Automatic Speech
87
+ Recognition](https://arxiv.org/abs/1904.08779).
88
+ mask_time_prob (`float`, *optional*, defaults to 0.05):
89
+ Propability of each feature vector along the time axis to be chosen as the start of the vector span to be
90
+ masked. Approximately `mask_time_prob * sequence_length // mask_time_length` feature vectors will be masked
91
+ along the time axis. This is only relevant if `apply_spec_augment is True`.
92
+ mask_time_length (`int`, *optional*, defaults to 10):
93
+ Length of vector span along the time axis.
94
+ mask_time_min_masks (`int`, *optional*, defaults to 2),:
95
+ The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
96
+ irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
97
+ mask_time_min_masks''
98
+ mask_feature_prob (`float`, *optional*, defaults to 0.0):
99
+ Propability of each feature vector along the feature axis to be chosen as the start of the vector span to
100
+ be masked. Approximately `mask_time_prob * hidden_size // mask_time_length` feature vectors will be masked
101
+ along the time axis. This is only relevant if `apply_spec_augment is True`.
102
+ mask_feature_length (`int`, *optional*, defaults to 10):
103
+ Length of vector span along the feature axis.
104
+ num_codevectors_per_group (`int`, *optional*, defaults to 320):
105
+ Number of entries in each quantization codebook (group).
106
+ num_codevector_groups (`int`, *optional*, defaults to 2):
107
+ Number of codevector groups for product codevector quantization.
108
+ contrastive_logits_temperature (`float`, *optional*, defaults to 0.1):
109
+ The temperature *kappa* in the contrastive loss.
110
+ num_negatives (`int`, *optional*, defaults to 100):
111
+ Number of negative samples for the contrastive loss.
112
+ codevector_dim (`int`, *optional*, defaults to 256):
113
+ Dimensionality of the quantized feature vectors.
114
+ proj_codevector_dim (`int`, *optional*, defaults to 256):
115
+ Dimensionality of the final projection of both the quantized and the transformer features.
116
+ diversity_loss_weight (`int`, *optional*, defaults to 0.1):
117
+ The weight of the codebook diversity loss component.
118
+ ctc_loss_reduction (`str`, *optional*, defaults to `"mean"`):
119
+ Specifies the reduction to apply to the output of `torch.nn.CTCLoss`. Only relevant when training an
120
+ instance of [`WavLMForCTC`].
121
+ ctc_zero_infinity (`bool`, *optional*, defaults to `False`):
122
+ Whether to zero infinite losses and the associated gradients of `torch.nn.CTCLoss`. Infinite losses mainly
123
+ occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance
124
+ of [`WavLMForCTC`].
125
+ use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
126
+ Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
127
+ instance of [`WavLMForSequenceClassification`].
128
+ classifier_proj_size (`int`, *optional*, defaults to 256):
129
+ Dimensionality of the projection before token mean-pooling for classification.
130
+ tdnn_dim (`Tuple[int]` or `List[int]`, *optional*, defaults to `(512, 512, 512, 512, 1500)`):
131
+ A tuple of integers defining the number of output channels of each 1D convolutional layer in the *TDNN*
132
+ module of the *XVector* model. The length of *tdnn_dim* defines the number of *TDNN* layers.
133
+ tdnn_kernel (`Tuple[int]` or `List[int]`, *optional*, defaults to `(5, 3, 3, 1, 1)`):
134
+ A tuple of integers defining the kernel size of each 1D convolutional layer in the *TDNN* module of the
135
+ *XVector* model. The length of *tdnn_kernel* has to match the length of *tdnn_dim*.
136
+ tdnn_dilation (`Tuple[int]` or `List[int]`, *optional*, defaults to `(1, 2, 3, 1, 1)`):
137
+ A tuple of integers defining the dilation factor of each 1D convolutional layer in *TDNN* module of the
138
+ *XVector* model. The length of *tdnn_dilation* has to match the length of *tdnn_dim*.
139
+ xvector_output_dim (`int`, *optional*, defaults to 512):
140
+ Dimensionality of the *XVector* embedding vectors.
141
+ add_adapter (`bool`, *optional*, defaults to `False`):
142
+ Whether a convolutional network should be stacked on top of the Wav2Vec2 Encoder. Can be very useful for
143
+ warm-starting Wav2Vec2 for SpeechEncoderDecoder models.
144
+ adapter_kernel_size (`int`, *optional*, defaults to 3):
145
+ Kernel size of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
146
+ adapter_stride (`int`, *optional*, defaults to 2):
147
+ Stride of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
148
+ num_adapter_layers (`int`, *optional*, defaults to 3):
149
+ Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is
150
+ True`.
151
+ output_hidden_size (`int`, *optional*):
152
+ Dimensionality of the encoder output layer. If not defined, this defaults to *hidden-size*. Only relevant
153
+ if `add_adapter is True`.
154
+
155
+ Example:
156
+
157
+ ```python
158
+
159
+ ```
160
+
161
+ Example:
162
+
163
+ ```python
164
+ >>> from transformers import WavLMConfig, WavLMModel
165
+
166
+ >>> # Initializing a WavLM facebook/wavlm-base-960h style configuration
167
+ >>> configuration = WavLMConfig()
168
+
169
+ >>> # Initializing a model (with random weights) from the facebook/wavlm-base-960h style configuration
170
+ >>> model = WavLMModel(configuration)
171
+
172
+ >>> # Accessing the model configuration
173
+ >>> configuration = model.config
174
+ ```"""
175
+
176
+ model_type = "wavlm"
177
+
178
+ def __init__(
179
+ self,
180
+ vocab_size=32,
181
+ hidden_size=768,
182
+ num_hidden_layers=12,
183
+ num_attention_heads=12,
184
+ intermediate_size=3072,
185
+ hidden_act="gelu",
186
+ hidden_dropout=0.1,
187
+ activation_dropout=0.1,
188
+ attention_dropout=0.1,
189
+ feat_proj_dropout=0.0,
190
+ final_dropout=0.1,
191
+ layerdrop=0.1,
192
+ initializer_range=0.02,
193
+ layer_norm_eps=1e-5,
194
+ feat_extract_norm="group",
195
+ feat_extract_activation="gelu",
196
+ conv_dim=(512, 512, 512, 512, 512, 512, 512),
197
+ conv_stride=(5, 2, 2, 2, 2, 2, 2),
198
+ conv_kernel=(10, 3, 3, 3, 3, 2, 2),
199
+ conv_bias=False,
200
+ num_conv_pos_embeddings=128,
201
+ num_conv_pos_embedding_groups=16,
202
+ num_buckets=320,
203
+ max_bucket_distance=800,
204
+ do_stable_layer_norm=False,
205
+ apply_spec_augment=True,
206
+ mask_time_prob=0.05,
207
+ mask_time_length=10,
208
+ mask_time_min_masks=2,
209
+ mask_feature_prob=0.0,
210
+ mask_feature_length=10,
211
+ num_codevectors_per_group=320,
212
+ num_codevector_groups=2,
213
+ contrastive_logits_temperature=0.1,
214
+ num_negatives=100,
215
+ codevector_dim=256,
216
+ proj_codevector_dim=256,
217
+ diversity_loss_weight=0.1,
218
+ ctc_loss_reduction="mean",
219
+ ctc_zero_infinity=False,
220
+ use_weighted_layer_sum=False,
221
+ classifier_proj_size=256,
222
+ tdnn_dim=(512, 512, 512, 512, 1500),
223
+ tdnn_kernel=(5, 3, 3, 1, 1),
224
+ tdnn_dilation=(1, 2, 3, 1, 1),
225
+ xvector_output_dim=512,
226
+ num_ctc_classes=80,
227
+ pad_token_id=0,
228
+ bos_token_id=1,
229
+ eos_token_id=2,
230
+ add_adapter=False,
231
+ adapter_kernel_size=3,
232
+ adapter_stride=2,
233
+ num_adapter_layers=3,
234
+ output_hidden_size=None,
235
+ loss_fct: str = 'cross_entropy', # cross_entropy, additive_margin, additive_angular_margin
236
+ label_smoothing: float = 0.0,
237
+ scale: float = 30.0,
238
+ margin: float = 0.35,
239
+ easy_margin: bool = False,
240
+ reduction: str = "mean",
241
+ **kwargs,
242
+ ):
243
+ super().__init__(**kwargs, pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id)
244
+ self.hidden_size = hidden_size
245
+ self.feat_extract_norm = feat_extract_norm
246
+ self.feat_extract_activation = feat_extract_activation
247
+ self.conv_dim = list(conv_dim)
248
+ self.conv_stride = list(conv_stride)
249
+ self.conv_kernel = list(conv_kernel)
250
+ self.conv_bias = conv_bias
251
+ self.num_buckets = num_buckets
252
+ self.max_bucket_distance = max_bucket_distance
253
+ self.num_conv_pos_embeddings = num_conv_pos_embeddings
254
+ self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
255
+ self.num_feat_extract_layers = len(self.conv_dim)
256
+ self.num_hidden_layers = num_hidden_layers
257
+ self.intermediate_size = intermediate_size
258
+ self.hidden_act = hidden_act
259
+ self.num_attention_heads = num_attention_heads
260
+ self.hidden_dropout = hidden_dropout
261
+ self.attention_dropout = attention_dropout
262
+ self.activation_dropout = activation_dropout
263
+ self.feat_proj_dropout = feat_proj_dropout
264
+ self.final_dropout = final_dropout
265
+ self.layerdrop = layerdrop
266
+ self.layer_norm_eps = layer_norm_eps
267
+ self.initializer_range = initializer_range
268
+ self.num_ctc_classes = num_ctc_classes
269
+ self.vocab_size = vocab_size
270
+ self.do_stable_layer_norm = do_stable_layer_norm
271
+ self.use_weighted_layer_sum = use_weighted_layer_sum
272
+ self.classifier_proj_size = classifier_proj_size
273
+
274
+ if (
275
+ (len(self.conv_stride) != self.num_feat_extract_layers)
276
+ or (len(self.conv_kernel) != self.num_feat_extract_layers)
277
+ or (len(self.conv_dim) != self.num_feat_extract_layers)
278
+ ):
279
+ raise ValueError(
280
+ "Configuration for convolutional layers is incorrect. It is required that `len(config.conv_dim)` =="
281
+ " `len(config.conv_stride)` == `len(config.conv_kernel)`, but is `len(config.conv_dim) ="
282
+ f" {len(self.conv_dim)}`, `len(config.conv_stride) = {len(self.conv_stride)}`,"
283
+ f" `len(config.conv_kernel) = {len(self.conv_kernel)}`."
284
+ )
285
+
286
+ # fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
287
+ self.apply_spec_augment = apply_spec_augment
288
+ self.mask_time_prob = mask_time_prob
289
+ self.mask_time_length = mask_time_length
290
+ self.mask_time_min_masks = mask_time_min_masks
291
+ self.mask_feature_prob = mask_feature_prob
292
+ self.mask_feature_length = mask_feature_length
293
+
294
+ # parameters for pretraining with codevector quantized representations
295
+ self.num_codevectors_per_group = num_codevectors_per_group
296
+ self.num_codevector_groups = num_codevector_groups
297
+ self.contrastive_logits_temperature = contrastive_logits_temperature
298
+ self.num_negatives = num_negatives
299
+ self.codevector_dim = codevector_dim
300
+ self.proj_codevector_dim = proj_codevector_dim
301
+ self.diversity_loss_weight = diversity_loss_weight
302
+
303
+ # ctc loss
304
+ self.ctc_loss_reduction = ctc_loss_reduction
305
+ self.ctc_zero_infinity = ctc_zero_infinity
306
+
307
+ # adapter
308
+ self.add_adapter = add_adapter
309
+ self.adapter_kernel_size = adapter_kernel_size
310
+ self.adapter_stride = adapter_stride
311
+ self.num_adapter_layers = num_adapter_layers
312
+ self.output_hidden_size = output_hidden_size or hidden_size
313
+
314
+ # SequenceClassification-specific parameter. Feel free to ignore for other classes.
315
+ self.classifier_proj_size = classifier_proj_size
316
+
317
+ # XVector-specific parameters. Feel free to ignore for other classes.
318
+ self.tdnn_dim = list(tdnn_dim)
319
+ self.tdnn_kernel = list(tdnn_kernel)
320
+ self.tdnn_dilation = list(tdnn_dilation)
321
+ self.xvector_output_dim = xvector_output_dim
322
+
323
+ # Loss function parameters. Feel free to ignore for other classes.
324
+ self.loss_fct = loss_fct
325
+ self.label_smoothing = label_smoothing
326
+ self.scale = scale
327
+ self.margin = margin
328
+ self.easy_margin = easy_margin
329
+ self.reduction = reduction
330
+
331
+ @property
332
+ def inputs_to_logits_ratio(self):
333
+ return functools.reduce(operator.mul, self.conv_stride, 1)