ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Nov 17, 2024

Commit

5f3cd68

verified ·

1 Parent(s): 343402b

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -7

README.md CHANGED Viewed

@@ -52,6 +52,11 @@ This repo contains the following configurations under `./models/`:
         + Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
     + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
     	- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
 * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
@@ -92,15 +97,9 @@ This repo contains the following configurations under `./models/`:
     * Throughput and memory usage should be constant between inferencing steps.
     * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
   * Weights will be added as the model is trained.
-* `config.llama[experimental].yaml` / `ar+nar-experimental-llama-8`: A salvaged-experiment of `ar+nar-llama-8`.
-  * These weights were from an oversight in trying to train a fully non-autoregressive model.
-    * Demasking was trained autoregressively instead of autoregressively, making this error possibly salvageable for the base model.
-  * This *might* have better output by accounting for possible errors from prior tokens, making it more robust, in theory.
-    * The theory is that training was on tokens being randomly masked off.
-  * These weights right now need to be "fixed" with proper, normal training, before replacing the original reference model.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.

         + Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
     + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
     	- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
+    + Addendum: This received a lot of additional training (~60k more steps).
+      + This post-training *was* intended to teach the model a pure NAR-RVQ-level-0 task for parallel decoding, but an error proved it to actually make it into decent AR training.
+      + Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
+      + Regression tests are needed just in case I did botch something, but it seems really nice so far.
+        + The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
 * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
     * Throughput and memory usage should be constant between inferencing steps.
     * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
+    * ...except STT, this received no STT training out of fear of botching the model.
   * Weights will be added as the model is trained.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.