Update README.md
Browse files
README.md
CHANGED
@@ -52,6 +52,11 @@ This repo contains the following configurations under `./models/`:
|
|
52 |
+ Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
|
53 |
+ Seems to be a decent foundation for "distillation", at the very least for LoRA training.
|
54 |
- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
* ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
|
57 |
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
@@ -92,15 +97,9 @@ This repo contains the following configurations under `./models/`:
|
|
92 |
* Throughput and memory usage should be constant between inferencing steps.
|
93 |
* The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
|
94 |
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
|
|
95 |
* Weights will be added as the model is trained.
|
96 |
|
97 |
-
* `config.llama[experimental].yaml` / `ar+nar-experimental-llama-8`: A salvaged-experiment of `ar+nar-llama-8`.
|
98 |
-
* These weights were from an oversight in trying to train a fully non-autoregressive model.
|
99 |
-
* Demasking was trained autoregressively instead of autoregressively, making this error possibly salvageable for the base model.
|
100 |
-
* This *might* have better output by accounting for possible errors from prior tokens, making it more robust, in theory.
|
101 |
-
* The theory is that training was on tokens being randomly masked off.
|
102 |
-
* These weights right now need to be "fixed" with proper, normal training, before replacing the original reference model.
|
103 |
-
|
104 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
105 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
106 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
|
|
52 |
+ Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
|
53 |
+ Seems to be a decent foundation for "distillation", at the very least for LoRA training.
|
54 |
- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
|
55 |
+
+ Addendum: This received a lot of additional training (~60k more steps).
|
56 |
+
+ This post-training *was* intended to teach the model a pure NAR-RVQ-level-0 task for parallel decoding, but an error proved it to actually make it into decent AR training.
|
57 |
+
+ Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
|
58 |
+
+ Regression tests are needed just in case I did botch something, but it seems really nice so far.
|
59 |
+
+ The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
|
60 |
|
61 |
* ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
|
62 |
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
|
|
97 |
* Throughput and memory usage should be constant between inferencing steps.
|
98 |
* The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
|
99 |
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
100 |
+
* ...except STT, this received no STT training out of fear of botching the model.
|
101 |
* Weights will be added as the model is trained.
|
102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
103 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
104 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
105 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|