ecker commited on
Commit
fd4aef5
·
verified ·
1 Parent(s): e626434

Update README.md

Browse files

RIP the DAC dream

Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -57,13 +57,15 @@ Some additional configurations have been explored with, but experiments have not
57
  + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
58
  + Because of this, training losses are high and it's having a hard time trying to converge.
59
  + It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
60
- + I believe there's hope to use it when I requantize my audio properly.
 
61
  * a model with a causal size >1 (sampling more than one token for the AR):
62
- + re-using an exisitng model or training from scratch does not have fruitful results.
63
  + there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
64
  + unfortunately it requires:
65
  + either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
66
  + a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
 
67
 
68
  Some current "achitectural features" are in-use, but their effects need to be experimented with further:
69
  * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
 
57
  + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
58
  + Because of this, training losses are high and it's having a hard time trying to converge.
59
  + It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
60
+ + ~~I believe there's hope to use it when I requantize my audio properly.~~
61
+ + Addendum: even after properly processing my audio, the loss is actually *worse* than before. I imagine DAC just cannot be used as an intermediary for an LM.
62
  * a model with a causal size >1 (sampling more than one token for the AR):
63
+ + re-using an existing model or training from scratch does not have fruitful results.
64
  + there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
65
  + unfortunately it requires:
66
  + either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
67
  + a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
68
+ + I just don't understand where the issue lies, since parallel decoding does work, as evidence with the NAR.
69
 
70
  Some current "achitectural features" are in-use, but their effects need to be experimented with further:
71
  * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).