ecker commited on
Commit
5e1ff4a
·
verified ·
1 Parent(s): 822037a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -45,6 +45,17 @@ This repo contains the following configurations under `./models/`:
45
  + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
46
  - Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
47
 
 
 
 
 
 
 
 
 
 
 
 
48
  Some additional configurations have been explored with, but experiments have not been fruitful:
49
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
50
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 
45
  + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
46
  - Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
47
 
48
+ * `config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`: The above, but with partially trained for STT.
49
+ + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
50
+ + Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
51
+ + Will need the former training to "undo" any issues with durations, as it usually came up before.
52
+ + `stt` task simply takes a piece of audio and outputs a transcription using IPA phonemes (that the model already is trained against for its text inputs).
53
+ + Can be done with `--task=stt` and an empty (`""`) text input through the CLI interface or the `Speech-to-Text` tab in the web UI.
54
+ + This mainly serves as a stepping stone before pivoting towards SpeechX tasks.
55
+ + I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
56
+ + This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
57
+ + STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
58
+
59
  Some additional configurations have been explored with, but experiments have not been fruitful:
60
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
61
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.