ecker commited on
Commit
5f3cd68
·
verified ·
1 Parent(s): 343402b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -52,6 +52,11 @@ This repo contains the following configurations under `./models/`:
52
  + Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
53
  + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
54
  - Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
 
 
 
 
 
55
 
56
  * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
57
  + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
@@ -92,15 +97,9 @@ This repo contains the following configurations under `./models/`:
92
  * Throughput and memory usage should be constant between inferencing steps.
93
  * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
94
  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
 
95
  * Weights will be added as the model is trained.
96
 
97
- * `config.llama[experimental].yaml` / `ar+nar-experimental-llama-8`: A salvaged-experiment of `ar+nar-llama-8`.
98
- * These weights were from an oversight in trying to train a fully non-autoregressive model.
99
- * Demasking was trained autoregressively instead of autoregressively, making this error possibly salvageable for the base model.
100
- * This *might* have better output by accounting for possible errors from prior tokens, making it more robust, in theory.
101
- * The theory is that training was on tokens being randomly masked off.
102
- * These weights right now need to be "fixed" with proper, normal training, before replacing the original reference model.
103
-
104
  Some additional configurations have been explored with, but experiments have not been fruitful:
105
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
106
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 
52
  + Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
53
  + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
54
  - Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
55
+ + Addendum: This received a lot of additional training (~60k more steps).
56
+ + This post-training *was* intended to teach the model a pure NAR-RVQ-level-0 task for parallel decoding, but an error proved it to actually make it into decent AR training.
57
+ + Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
58
+ + Regression tests are needed just in case I did botch something, but it seems really nice so far.
59
+ + The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
60
 
61
  * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
62
  + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
 
97
  * Throughput and memory usage should be constant between inferencing steps.
98
  * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
99
  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
100
+ * ...except STT, this received no STT training out of fear of botching the model.
101
  * Weights will be added as the model is trained.
102
 
 
 
 
 
 
 
 
103
  Some additional configurations have been explored with, but experiments have not been fruitful:
104
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
105
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.