ecker commited on
Commit
796c86d
·
verified ·
1 Parent(s): 35058f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -5
README.md CHANGED
@@ -4,9 +4,7 @@ license: agpl-3.0
4
 
5
  This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation as I try and iron out the kinks.
6
 
7
- The model currently is in a *semi-usable* state, and I'm releasing them now in hopes that it also helps jumpstart anyone else that wants to use them.
8
-
9
- To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.
10
 
11
  ## Models
12
 
@@ -23,12 +21,15 @@ This repo contains the following configurations under `./models/`:
23
  + Prior testing showed that longer prompt durations results in better utterances.
24
  + *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
25
  + However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
 
26
  + Currently does not seem to work anymore due to regressions in the code.
27
 
28
  * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
29
  + This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
30
  + Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
 
31
  + Utilizes a HF tokenizer for "optimal" vocab.
 
32
  + The current RVQ level is included as a token as well to help guide NAR tasks better.
33
  + This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
34
  + Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
@@ -67,13 +68,19 @@ This repo contains the following configurations under `./models/`:
67
  * `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
68
  + Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
69
  + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
 
70
  + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
71
- + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
72
  + Goal is to utilize self-speculation sampling to enable speedups when possible.
73
- + Current implementation will early-exit if the entropy/varentropy of the logits are low enough
 
74
  + Training is a pain.
75
  + LayerSkip-aware training does *not* like to train under ROCm.
76
  + Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
 
 
 
 
 
77
 
78
  Some additional configurations have been explored with, but experiments have not been fruitful:
79
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 
4
 
5
  This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-e-r/vall-e) implementation as I try and iron out the kinks.
6
 
7
+ The model currently is in a *usable* state under `ar+nar-llama-8` (the default model thats downloaded).
 
 
8
 
9
  ## Models
10
 
 
21
  + Prior testing showed that longer prompt durations results in better utterances.
22
  + *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
23
  + However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
24
+ + I would love to revisit this with my more-better-er training paradigms.
25
  + Currently does not seem to work anymore due to regressions in the code.
26
 
27
  * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
28
  + This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
29
  + Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
30
+ + However, the opposite is not true: a model trained with summed embeddings does not function after disabling this.
31
  + Utilizes a HF tokenizer for "optimal" vocab.
32
+ + Optimal in the sense it uses the remaining portion of the 256 indices for merged phonemes (although I imagine it would be better NOT to merge, as the model's focus isn't in phoneme output).
33
  + The current RVQ level is included as a token as well to help guide NAR tasks better.
34
  + This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
35
  + Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
 
68
  * `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
69
  + Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
70
  + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
71
+ + Initially trained with LaterSkip hyperparamenters `R=4` and `e_scale=0.2`, but midway through swapped to `R=2` and `e_scale=0.1` to maintain stability.
72
  + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
 
73
  + Goal is to utilize self-speculation sampling to enable speedups when possible.
74
+ + Current implementation will early-exit if the entropy/varentropy of the logits are low enough (<0.1).
75
+ + Speedups seem to shave off a second of inference time.
76
  + Training is a pain.
77
  + LayerSkip-aware training does *not* like to train under ROCm.
78
  + Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
79
+ + LayerSkip-aware training seems to degrade the model enough to where it harms the models ability to sound similar to the reference prompt the more it trains.
80
+ + I imagine this techique only really works for "large" enough models (be it wide and/or deep enough) that may cause it to second-guess in the later levels.
81
+ + The current size of VALL-E doesn't seem to necessitate LayerSkip, as it seems to instead dumb the model down to ~9 layers instead of 12 (as it typically exits early at layer 9, and the remaining layers offer little additional benefits).
82
+ + This *does* seem to prove a nice way to shrink models, and perhaps even grow them? I remember finding trying to grow a model causes the extra layers to be useless.
83
+ * Unless I get a revelation, this experiment is bunk unless it can magically live through a LoRA.
84
 
85
  Some additional configurations have been explored with, but experiments have not been fruitful:
86
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.