File size: 2,699 Bytes
6cae1c4
 
 
 
 
 
 
 
 
 
 
 
ddfe64b
6cae1c4
48dd8f0
 
6cae1c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ddfe64b
6cae1c4
 
 
 
 
 
 
 
 
 
 
 
 
 
431c061
6cae1c4
 
 
 
ddfe64b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: mit
base_model:
- openai/whisper-large-v3-turbo
tags:
- asr
- optimizer
- speech
- audio
- frequency
---

--Proof of concept-- in Beta... or theta.

This optimizer is specifically for ASR type models but works well without the FAM which can be controlled (turned on) by step count. fam_start_step=100. 

An experimental approach specifically designed for speech recognition tasks, FAM adapts momentum based on the frequency characteristics of gradient updates.

### Frequency-Adaptive Momentum (FAM)

#### Core Concept

- Speech signals possess an inherent frequency structure, with different parts of the model responding to various frequency bands. This frequency structure remains preserved, albeit transformed, when converted to log-mel spectrograms, with model parameters adapting to capture this structure.
- The Chain of Frequency Information: Original Audio → Log-Mel Spectrogram → Encoder Parameters → Gradient Updates.
- Empirical observations reveal that transformer-based speech models develop:
  - Lower encoder layers with filters responsive to specific frequency bands in the mel spectrogram.
  - Attention heads tracking particular acoustic patterns over time.
  - A hierarchical representation from acoustic features to phonetic units to words.
- FAM aims to integrate a momentum scheme that adapts based on the "frequency signature" of gradient updates.

#### Why This Optimizer Makes Sense

FAM acknowledges the frequency structure within the optimization process itself, recognizing that:
- **Gradient Frequencies Matter:** The Fourier transform of gradient updates reveals patterns linked to the model's current learning phase.
- **Different Parameters Process Different Bands:** Similar to how our ears have frequency-specific receptors, different parts of the model specialize in various acoustic frequencies.
- **Temporal Structure in Learning:** Speech learning progresses through stages - from basic acoustics to phonetic patterns to linguistic structures.

By applying distinct momentum factors to different frequency bands in parameter space, FAM provides the optimizer with domain-specific audio information that it otherwise wouldn't have.

download and test it for free! :D

https://github.com/sine2pi/FAMOptimizer

Usage example
```python
param_groups = get_parameter_groups(model=model, lr=0.001, weight_decay=1e-6)

optimizer = FAMOptimizer(
    params=param_groups,
    beta=0.99,
    n_bands=10,
    fam_start_step=100,
    layer_boost=True,
    min_size=128,
    debug=True,
    weight_decay=0.0025,
    lr=0.001,
)

scheduler = FAMScheduler(
    optimizer=optimizer,
    warmup_steps=100,
    total_steps=10000,
    decay_start_step=100
)
```