jbetker commited on
Commit
07a6edc
1 Parent(s): 78398eb
.models/classifier.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95ab946010be0a963b5039e8fca74bbb8a6eebcf366c761db21ae7e94cd6ada3
3
+ size 60938957
README.md CHANGED
@@ -1,77 +1,193 @@
1
- # Tortoise-TTS
2
 
3
- Tortoise TTS is an experimental text-to-speech program that uses recent machine learning techniques to generate
4
- high-quality speech samples.
 
 
5
 
6
  This repo contains all the code needed to run Tortoise TTS in inference mode.
7
 
8
  ## What's in a name?
9
 
10
  I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
11
- is insanely slow. It leverages both an autoregressive speech alignment model and a diffusion model, both of which
12
- are known for their slow inference. It also performs CLIP sampling, which slows things down even further. You can
13
- expect ~5 seconds of speech to take ~30 seconds to produce on the latest hardware. Still, the results are pretty cool.
14
-
15
- ## What the heck is this?
16
 
17
- Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data. It is made up of 4 separate models that work together.
18
- These models are all derived from different repositories which are all linked. All the models have been modified
19
- for this use case (some substantially so).
20
 
21
- First, an autoregressive transformer stack predicts discrete speech "tokens" given a text prompt. This model is very
22
- similar to the GPT model used by DALLE, except it operates on speech data.
23
- Based on: [GPT2 from Transformers](https://huggingface.co/docs/transformers/model_doc/gpt2)
24
 
25
- Next, a CLIP model judges a batch of outputs from the autoregressive transformer against the provided text and stack
26
- ranks the outputs according to most probable. You could use greedy or beam-search decoding but in my experience CLIP
27
- decoding creates considerably better results.
28
- Based on [CLIP from lucidrains](https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/dalle_pytorch.py)
29
 
30
- Next, the speech "tokens" are decoded into a low-quality MEL spectrogram using a VQVAE.
31
- Based on [VQVAE2 by rosinality](https://github.com/rosinality/vq-vae-2-pytorch)
32
 
33
- Finally, the output of the VQVAE is further decoded by a UNet diffusion model into raw audio, which can be placed in
34
- a wav file.
35
- Based on [ImprovedDiffusion by openai](https://github.com/openai/improved-diffusion)
36
 
37
- ## How do I use this?
38
 
39
- Check out the colab: https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
40
 
41
- Or on a computer with a GPU (with >=16GB of VRAM):
42
  ```shell
43
  git clone https://github.com/neonbjb/tortoise-tts.git
44
  cd tortoise-tts
45
  pip install -r requirements.txt
46
- python do_tts.py
47
  ```
48
 
49
- ## Hand-picked TTS samples
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- I generated ~250 samples from 23 text prompts and 8 voices. The text prompts have never been seen by the model. The
52
- voices were pulled from the training set.
 
 
 
 
53
 
54
- All of the samples can be found in the results/ folder of this repo. I handpicked a few to show what the model is capable of:
55
 
56
- - [Atkins - Road not taken](results/favorites/atkins_road_not_taken.wav)
57
- - [Dotrice - Rolling Stone interview](results/favorites/dotrice_rollingstone.wav)
58
- - [Dotrice - 'Ornaments' from tacotron test set](results/favorites/dotrice_tacotron_samp1.wav)
59
- - [Kennard - 'Acute emotional intelligence' from tacotron test set](results/favorites/kennard_tacotron_samp2.wav)
60
- - [Mol - Because I could not stop for death](results/favorites/mol_dickenson.wav)
61
- - [Mol - Obama](results/favorites/mol_obama.wav)
62
 
63
- Prosody is remarkably good for poetry, despite the fact that it was never trained on poetry.
 
 
 
 
 
64
 
65
- ## How do I train this?
66
 
67
- Frankly - you don't. Building this model has been a labor of love for me, consuming most of my 6 RTX3090s worth of
68
- resources for the better part of 6 months. It uses a dataset I've gathered, refined and transcribed that consists of
69
- a lot of audio data which I cannot distribute because of copywrite or no open licenses.
70
 
71
- With that said, I'm willing to help you out if you really want to give it a shot. DM me.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  ## Looking forward
74
 
75
- I'm not satisfied with this yet. Treat this as a "sneak peek" and check back in a couple of months. I think the concept
76
- is sound, but there are a few hurdles to overcome to get sample quality up. I have been doing major tweaks to the
77
- diffusion model and should have something new and much better soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TorToiSe
2
 
3
+ Tortoise is a text-to-speech program built with the following priorities:
4
+
5
+ 1. Strong multi-voice capabilities.
6
+ 2. Highly realistic prosody and intonation.
7
 
8
  This repo contains all the code needed to run Tortoise TTS in inference mode.
9
 
10
  ## What's in a name?
11
 
12
  I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
13
+ is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low
14
+ sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
 
 
 
15
 
16
+ ## Demos
 
 
17
 
18
+ See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs.
 
 
19
 
20
+ ## Usage guide
 
 
 
21
 
22
+ ### Colab
 
23
 
24
+ Colab is the easiest way to try this out. I've put together a notebook you can use here:
25
+ https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
 
26
 
27
+ ### Installation
28
 
29
+ If you want to use this on your own computer, you must have an NVIDIA GPU. Installation:
30
 
 
31
  ```shell
32
  git clone https://github.com/neonbjb/tortoise-tts.git
33
  cd tortoise-tts
34
  pip install -r requirements.txt
 
35
  ```
36
 
37
+ ### do_tts.py
38
+
39
+ This script allows you to speak a single phrase with one or more voices.
40
+ ```shell
41
+ python do_tts.py --text "I'm going to speak this" --voice dotrice --preset fast
42
+ ```
43
+
44
+ ### read.py
45
+
46
+ This script provides tools for reading large amounts of text.
47
+
48
+ ```shell
49
+ python read.py --textfile <your text to be read> --voice dotrice
50
+ ```
51
+
52
+ This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
53
+ of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and
54
+ output that as well.
55
+
56
+ Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate
57
+ argument.
58
+
59
+ ### API
60
+
61
+ Tortoise can be used programmatically, like so:
62
+
63
+ ```python
64
+ reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
65
+ tts = api.TextToSpeech()
66
+ pcm_audio = tts.tts_with_preset("your text here", reference_clips, preset='fast')
67
+ ```
68
+
69
+ ## Voice customization guide
70
+
71
+ Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
72
+
73
+ These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
74
+
75
+ ### Provided voices
76
+
77
+ This repo comes with several pre-packaged voices. You will be familiar with many of them. :)
78
+
79
+ Most of the provided voices were not found in the training set. Experimentally, it seems that voices from the training set
80
+ produce more realistic outputs then those outside of the training set. Any voice prepended with "train" came from the
81
+ training set.
82
+
83
+ ### Adding a new voice
84
+
85
+ To add new voices to Tortoise, you will need to do the following:
86
 
87
+ 1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
88
+ 2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
89
+ 3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
90
+ 4. Create a subdirectory in voices/
91
+ 5. Put your clips in that subdirectory.
92
+ 6. Run tortoise utilities with --voice=<your_subdirectory_name>.
93
 
94
+ ### Picking good reference clips
95
 
96
+ As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
97
+ good clips:
 
 
 
 
98
 
99
+ 1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
100
+ 2. Avoid speeches. These generally have distortion caused by the amplification system.
101
+ 3. Avoid clips from phone calls.
102
+ 4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
103
+ 5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
104
+ 6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
105
 
106
+ ## Advanced Usage
107
 
108
+ ### Generation settings
 
 
109
 
110
+ Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
111
+ that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
112
+ various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
113
+ set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
114
+ these settings (and it's very likely that I missed something!)
115
+
116
+ These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
117
+ ```api.tts``` for a full list.
118
+
119
+ ### Playing with the voice latent
120
+
121
+ Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting
122
+ everything from tone to speaking rate to speech abnormalities.
123
+
124
+ This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like. You could also theoretically build a small extension to Tortoise that gradually shifts the
125
+ latent from one speaker to another, then apply it across a bit of spoken text (something I havent implemented yet, but might
126
+ get to soon!) I am sure there are other interesting things that can be done here. Please let me know what you find!
127
+
128
+ ### Send me feedback!
129
+
130
+ Probabilistic models like Tortoise are best thought of as an "augmented search" - in this case, through the space of possible
131
+ utterances of a specific string of text. The impact of community involvement in perusing these spaces (such as is being done with
132
+ GPT-3 or CLIP) has really surprised me. If you find something neat that you can do with Tortoise that isn't documented here,
133
+ please report it to me! I would be glad to publish it to this page.
134
+
135
+ ## Model architecture
136
+
137
+ Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
138
+ models that work together. I've assembled a write-up of the system architecture here:
139
+ [https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
140
+
141
+ ## Training
142
+
143
+ These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
144
+ ~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
145
+ [DLAS](https://github.com/neonbjb/DL-Art-School) trainer.
146
+
147
+ I currently do not have plans to release the training configurations or methodology. See the next section..
148
+
149
+ ## Ethical Considerations
150
+
151
+ Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
152
+ wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
153
+ could be misused are many. It doesn't take much creativity to think up how.
154
+
155
+ After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
156
+
157
+ 1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
158
+ 2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
159
+ 3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
160
+ 4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
161
+ 5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
162
+
163
+ ### Diversity
164
+
165
+ The diversity expressed by ML models is strongly tied to the datasets they were trained on.
166
+
167
+ Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
168
+ balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
169
+ or of people who speak with strong accents.
170
 
171
  ## Looking forward
172
 
173
+ Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
174
+ training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
175
+ of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
176
+ exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
177
+
178
+ I want to mention here
179
+ that I think Tortoise could do be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
180
+ or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
181
+ to believe that the same is not true of TTS.
182
+
183
+ The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer.
184
+ Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.
185
+
186
+ If you are an ethical organization with computational resources to spare interested in seeing what this model could do
187
+ if properly scaled out, please reach out to me! I would love to collaborate on this.
188
+
189
+ ## Notice
190
+
191
+ Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.
192
+
193
+ If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.
api.py CHANGED
@@ -6,6 +6,7 @@ from urllib import request
6
  import torch
7
  import torch.nn.functional as F
8
  import progressbar
 
9
 
10
  from models.cvvp import CVVP
11
  from models.diffusion_decoder import DiffusionTts
@@ -21,13 +22,19 @@ from utils.tokenizer import VoiceBpeTokenizer, lev_distance
21
 
22
 
23
  pbar = None
 
 
24
  def download_models():
 
 
 
25
  MODELS = {
26
- 'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth',
27
- 'clvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clip.pth',
28
- 'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth',
29
- 'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth',
30
- 'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth',
 
31
  }
32
  os.makedirs('.models', exist_ok=True)
33
  def show_progress(block_num, block_size, total_size):
@@ -51,6 +58,9 @@ def download_models():
51
 
52
 
53
  def pad_or_truncate(t, length):
 
 
 
54
  if t.shape[-1] == length:
55
  return t
56
  elif t.shape[-1] < length:
@@ -68,7 +78,10 @@ def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusi
68
  conditioning_free=cond_free, conditioning_free_k=cond_free_k)
69
 
70
 
71
- def load_conditioning(clip, cond_length=132300):
 
 
 
72
  gap = clip.shape[-1] - cond_length
73
  if gap < 0:
74
  clip = F.pad(clip, pad=(0, abs(gap)))
@@ -79,29 +92,6 @@ def load_conditioning(clip, cond_length=132300):
79
  return mel_clip.unsqueeze(0).cuda()
80
 
81
 
82
- def clip_guided_generation(autoregressive_model, clip_model, conditioning_input, text_input, num_batches, stop_mel_token,
83
- tokens_per_clip_inference=10, clip_results_to_reduce_to=8, **generation_kwargs):
84
- """
85
- Uses a CLVP model trained to associate full text with **partial** audio clips to pick the best generation candidates
86
- every few iterations. The top results are then propagated forward through the generation process. Rinse and repeat.
87
- This is a hybrid between beam search and sampling.
88
- """
89
- token_goal = tokens_per_clip_inference
90
- finished = False
91
- while not finished and token_goal < autoregressive_model.max_mel_tokens:
92
- samples = []
93
- for b in tqdm(range(num_batches)):
94
- codes = autoregressive_model.inference_speech(conditioning_input, text_input, **generation_kwargs)
95
- samples.append(codes)
96
- for batch in samples:
97
- for i in range(batch.shape[0]):
98
- batch[i] = fix_autoregressive_output(batch[i], stop_mel_token, complain=False)
99
- clip_results.append(clip_model(text_input.repeat(batch.shape[0], 1), batch, return_loss=False))
100
- clip_results = torch.cat(clip_results, dim=0)
101
- samples = torch.cat(samples, dim=0)
102
- best_results = samples[torch.topk(clip_results, k=clip_results_to_reduce_to).indices]
103
-
104
-
105
  def fix_autoregressive_output(codes, stop_token, complain=True):
106
  """
107
  This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was
@@ -130,29 +120,37 @@ def fix_autoregressive_output(codes, stop_token, complain=True):
130
  return codes
131
 
132
 
133
- def do_spectrogram_diffusion(diffusion_model, diffuser, mel_codes, conditioning_samples, temperature=1):
134
  """
135
  Uses the specified diffusion model to convert discrete codes into a spectrogram.
136
  """
137
  with torch.no_grad():
138
  cond_mels = []
139
  for sample in conditioning_samples:
 
 
140
  sample = pad_or_truncate(sample, 102400)
141
- cond_mel = wav_to_univnet_mel(sample.to(mel_codes.device), do_normalization=False)
142
  cond_mels.append(cond_mel)
143
  cond_mels = torch.stack(cond_mels, dim=1)
144
 
145
- output_seq_len = mel_codes.shape[1]*4*24000//22050 # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal.
146
- output_shape = (mel_codes.shape[0], 100, output_seq_len)
147
- precomputed_embeddings = diffusion_model.timestep_independent(mel_codes, cond_mels, output_seq_len, False)
148
 
149
- noise = torch.randn(output_shape, device=mel_codes.device) * temperature
150
  mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise,
151
- model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings})
 
152
  return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
153
 
154
 
155
  class TextToSpeech:
 
 
 
 
 
156
  def __init__(self, autoregressive_batch_size=16):
157
  self.autoregressive_batch_size = autoregressive_batch_size
158
  self.tokenizer = VoiceBpeTokenizer()
@@ -207,14 +205,59 @@ class TextToSpeech:
207
  kwargs.update(presets[preset])
208
  return self.tts(text, voice_samples, **kwargs)
209
 
210
- def tts(self, text, voice_samples, k=1,
211
  # autoregressive generation parameters follow
212
  num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
 
213
  # CLVP & CVVP parameters
214
  clvp_cvvp_slider=.5,
215
  # diffusion generation parameters follow
216
  diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0,
217
  **hf_generate_kwargs):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
  text = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).cuda()
219
  text = F.pad(text, (0, 1)) # This may not be necessary.
220
 
@@ -222,7 +265,7 @@ class TextToSpeech:
222
  if not isinstance(voice_samples, list):
223
  voice_samples = [voice_samples]
224
  for vs in voice_samples:
225
- conds.append(load_conditioning(vs))
226
  conds = torch.stack(conds, dim=1)
227
 
228
  diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=diffusion_iterations, cond_free=cond_free, cond_free_k=cond_free_k)
@@ -233,7 +276,9 @@ class TextToSpeech:
233
  stop_mel_token = self.autoregressive.stop_mel_token
234
  calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
235
  self.autoregressive = self.autoregressive.cuda()
236
- for b in tqdm(range(num_batches)):
 
 
237
  codes = self.autoregressive.inference_speech(conds, text,
238
  do_sample=True,
239
  top_p=top_p,
@@ -251,7 +296,9 @@ class TextToSpeech:
251
  clip_results = []
252
  self.clvp = self.clvp.cuda()
253
  self.cvvp = self.cvvp.cuda()
254
- for batch in samples:
 
 
255
  for i in range(batch.shape[0]):
256
  batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)
257
  clvp = self.clvp(text.repeat(batch.shape[0], 1), batch, return_loss=False)
@@ -276,7 +323,8 @@ class TextToSpeech:
276
  return_latent=True, clip_inputs=False)
277
  self.autoregressive = self.autoregressive.cpu()
278
 
279
- print("Performing vocoding..")
 
280
  wav_candidates = []
281
  self.diffusion = self.diffusion.cuda()
282
  self.vocoder = self.vocoder.cuda()
@@ -295,7 +343,7 @@ class TextToSpeech:
295
  latents = latents[:, :k]
296
  break
297
 
298
- mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, voice_samples, temperature=diffusion_temperature)
299
  wav = self.vocoder.inference(mel)
300
  wav_candidates.append(wav.cpu())
301
  self.diffusion = self.diffusion.cpu()
 
6
  import torch
7
  import torch.nn.functional as F
8
  import progressbar
9
+ import torchaudio
10
 
11
  from models.cvvp import CVVP
12
  from models.diffusion_decoder import DiffusionTts
 
22
 
23
 
24
  pbar = None
25
+
26
+
27
  def download_models():
28
+ """
29
+ Call to download all the models that Tortoise uses.
30
+ """
31
  MODELS = {
32
+ 'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/autoregressive.pth',
33
+ 'classifier.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/classifier.pth',
34
+ 'clvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/clvp.pth',
35
+ 'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/cvvp.pth',
36
+ 'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/diffusion_decoder.pth',
37
+ 'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/vocoder.pth',
38
  }
39
  os.makedirs('.models', exist_ok=True)
40
  def show_progress(block_num, block_size, total_size):
 
58
 
59
 
60
  def pad_or_truncate(t, length):
61
+ """
62
+ Utility function for forcing <t> to have the specified sequence length, whether by clipping it or padding it with 0s.
63
+ """
64
  if t.shape[-1] == length:
65
  return t
66
  elif t.shape[-1] < length:
 
78
  conditioning_free=cond_free, conditioning_free_k=cond_free_k)
79
 
80
 
81
+ def format_conditioning(clip, cond_length=132300):
82
+ """
83
+ Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models.
84
+ """
85
  gap = clip.shape[-1] - cond_length
86
  if gap < 0:
87
  clip = F.pad(clip, pad=(0, abs(gap)))
 
92
  return mel_clip.unsqueeze(0).cuda()
93
 
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  def fix_autoregressive_output(codes, stop_token, complain=True):
96
  """
97
  This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was
 
120
  return codes
121
 
122
 
123
+ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_samples, temperature=1, verbose=True):
124
  """
125
  Uses the specified diffusion model to convert discrete codes into a spectrogram.
126
  """
127
  with torch.no_grad():
128
  cond_mels = []
129
  for sample in conditioning_samples:
130
+ # The diffuser operates at a sample rate of 24000 (except for the latent inputs)
131
+ sample = torchaudio.functional.resample(sample, 22050, 24000)
132
  sample = pad_or_truncate(sample, 102400)
133
+ cond_mel = wav_to_univnet_mel(sample.to(latents.device), do_normalization=False)
134
  cond_mels.append(cond_mel)
135
  cond_mels = torch.stack(cond_mels, dim=1)
136
 
137
+ output_seq_len = latents.shape[1] * 4 * 24000 // 22050 # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal.
138
+ output_shape = (latents.shape[0], 100, output_seq_len)
139
+ precomputed_embeddings = diffusion_model.timestep_independent(latents, cond_mels, output_seq_len, False)
140
 
141
+ noise = torch.randn(output_shape, device=latents.device) * temperature
142
  mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise,
143
+ model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
144
+ progress=verbose)
145
  return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
146
 
147
 
148
  class TextToSpeech:
149
+ """
150
+ Main entry point into Tortoise.
151
+ :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
152
+ GPU OOM errors. Larger numbers generates slightly faster.
153
+ """
154
  def __init__(self, autoregressive_batch_size=16):
155
  self.autoregressive_batch_size = autoregressive_batch_size
156
  self.tokenizer = VoiceBpeTokenizer()
 
205
  kwargs.update(presets[preset])
206
  return self.tts(text, voice_samples, **kwargs)
207
 
208
+ def tts(self, text, voice_samples, k=1, verbose=True,
209
  # autoregressive generation parameters follow
210
  num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
211
+ typical_sampling=False, typical_mass=.9,
212
  # CLVP & CVVP parameters
213
  clvp_cvvp_slider=.5,
214
  # diffusion generation parameters follow
215
  diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0,
216
  **hf_generate_kwargs):
217
+ """
218
+ Produces an audio clip of the given text being spoken with the given reference voice.
219
+ :param text: Text to be spoken.
220
+ :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
221
+ :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP and CVVP models) clips are returned.
222
+ :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
223
+ ~~AUTOREGRESSIVE KNOBS~~
224
+ :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP+CVVP.
225
+ As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
226
+ :param temperature: The softmax temperature of the autoregressive model.
227
+ :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
228
+ :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
229
+ of long silences or "uhhhhhhs", etc.
230
+ :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
231
+ :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
232
+ :param typical_sampling: Turns typical sampling on or off. This sampling mode is discussed in this paper: https://arxiv.org/abs/2202.00666
233
+ I was interested in the premise, but the results were not as good as I was hoping. This is off by default, but
234
+ could use some tuning.
235
+ :param typical_mass: The typical_mass parameter from the typical_sampling algorithm.
236
+ ~~CLVP-CVVP KNOBS~~
237
+ :param clvp_cvvp_slider: Controls the influence of the CLVP and CVVP models in selecting the best output from the autoregressive model.
238
+ [0,1]. Values closer to 1 will cause Tortoise to emit clips that follow the text more. Values closer to
239
+ 0 will cause Tortoise to emit clips that more closely follow the reference clip (e.g. the voice sounds more
240
+ similar).
241
+ ~~DIFFUSION KNOBS~~
242
+ :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
243
+ the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
244
+ however.
245
+ :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
246
+ each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
247
+ of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
248
+ dramatically improves realism.
249
+ :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
250
+ As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
251
+ Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
252
+ :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
253
+ are the "mean" prediction of the diffusion network and will sound bland and smeared.
254
+ ~~OTHER STUFF~~
255
+ :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
256
+ Extra keyword args fed to this function get forwarded directly to that API. Documentation
257
+ here: https://huggingface.co/docs/transformers/internal/generation_utils
258
+ :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
259
+ Sample rate is 24kHz.
260
+ """
261
  text = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).cuda()
262
  text = F.pad(text, (0, 1)) # This may not be necessary.
263
 
 
265
  if not isinstance(voice_samples, list):
266
  voice_samples = [voice_samples]
267
  for vs in voice_samples:
268
+ conds.append(format_conditioning(vs))
269
  conds = torch.stack(conds, dim=1)
270
 
271
  diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=diffusion_iterations, cond_free=cond_free, cond_free_k=cond_free_k)
 
276
  stop_mel_token = self.autoregressive.stop_mel_token
277
  calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
278
  self.autoregressive = self.autoregressive.cuda()
279
+ if verbose:
280
+ print("Generating autoregressive samples..")
281
+ for b in tqdm(range(num_batches), disable=not verbose):
282
  codes = self.autoregressive.inference_speech(conds, text,
283
  do_sample=True,
284
  top_p=top_p,
 
296
  clip_results = []
297
  self.clvp = self.clvp.cuda()
298
  self.cvvp = self.cvvp.cuda()
299
+ if verbose:
300
+ print("Computing best candidates using CLVP and CVVP")
301
+ for batch in tqdm(samples, disable=not verbose):
302
  for i in range(batch.shape[0]):
303
  batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)
304
  clvp = self.clvp(text.repeat(batch.shape[0], 1), batch, return_loss=False)
 
323
  return_latent=True, clip_inputs=False)
324
  self.autoregressive = self.autoregressive.cpu()
325
 
326
+ if verbose:
327
+ print("Transforming autoregressive outputs into audio..")
328
  wav_candidates = []
329
  self.diffusion = self.diffusion.cuda()
330
  self.vocoder = self.vocoder.cuda()
 
343
  latents = latents[:, :k]
344
  break
345
 
346
+ mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, voice_samples, temperature=diffusion_temperature, verbose=verbose)
347
  wav = self.vocoder.inference(mel)
348
  wav_candidates.append(wav.cpu())
349
  self.diffusion = self.diffusion.cpu()
do_tts.py CHANGED
@@ -10,7 +10,7 @@ if __name__ == '__main__':
10
  parser = argparse.ArgumentParser()
11
  parser.add_argument('--text', type=str, help='Text to speak.', default="I am a language model that has learned to speak.")
12
  parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
13
- 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='patrick_stewart')
14
  parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
15
  parser.add_argument('--voice_diversity_intelligibility_slider', type=float,
16
  help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility',
 
10
  parser = argparse.ArgumentParser()
11
  parser.add_argument('--text', type=str, help='Text to speak.', default="I am a language model that has learned to speak.")
12
  parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
13
+ 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='pat')
14
  parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
15
  parser.add_argument('--voice_diversity_intelligibility_slider', type=float,
16
  help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility',
models/autoregressive.py CHANGED
@@ -356,7 +356,7 @@ class UnifiedVoice(nn.Module):
356
  preformatting to create a working TTS model.
357
  """
358
  # Set padding areas within MEL (currently it is coded with the MEL code for <zero>).
359
- mel_lengths = wav_lengths // self.mel_length_compression
360
  for b in range(len(mel_lengths)):
361
  actual_end = mel_lengths[b] + 1 # Due to the convolutional nature of how these tokens are generated, it would be best if the model predicts a token past the actual last token.
362
  if actual_end < mel_input_tokens.shape[-1]:
 
356
  preformatting to create a working TTS model.
357
  """
358
  # Set padding areas within MEL (currently it is coded with the MEL code for <zero>).
359
+ mel_lengths = torch.div(wav_lengths, self.mel_length_compression, rounding_mode='trunc')
360
  for b in range(len(mel_lengths)):
361
  actual_end = mel_lengths[b] + 1 # Due to the convolutional nature of how these tokens are generated, it would be best if the model predicts a token past the actual last token.
362
  if actual_end < mel_input_tokens.shape[-1]:
models/classifier.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+
4
+ class ResBlock(nn.Module):
5
+ def __init__(
6
+ self,
7
+ channels,
8
+ dropout,
9
+ out_channels=None,
10
+ use_conv=False,
11
+ use_scale_shift_norm=False,
12
+ dims=2,
13
+ up=False,
14
+ down=False,
15
+ kernel_size=3,
16
+ do_checkpoint=True,
17
+ ):
18
+ super().__init__()
19
+ self.channels = channels
20
+ self.dropout = dropout
21
+ self.out_channels = out_channels or channels
22
+ self.use_conv = use_conv
23
+ self.use_scale_shift_norm = use_scale_shift_norm
24
+ self.do_checkpoint = do_checkpoint
25
+ padding = 1 if kernel_size == 3 else 2
26
+
27
+ self.in_layers = nn.Sequential(
28
+ normalization(channels),
29
+ nn.SiLU(),
30
+ conv_nd(dims, channels, self.out_channels, kernel_size, padding=padding),
31
+ )
32
+
33
+ self.updown = up or down
34
+
35
+ if up:
36
+ self.h_upd = Upsample(channels, False, dims)
37
+ self.x_upd = Upsample(channels, False, dims)
38
+ elif down:
39
+ self.h_upd = Downsample(channels, False, dims)
40
+ self.x_upd = Downsample(channels, False, dims)
41
+ else:
42
+ self.h_upd = self.x_upd = nn.Identity()
43
+
44
+ self.out_layers = nn.Sequential(
45
+ normalization(self.out_channels),
46
+ nn.SiLU(),
47
+ nn.Dropout(p=dropout),
48
+ zero_module(
49
+ conv_nd(dims, self.out_channels, self.out_channels, kernel_size, padding=padding)
50
+ ),
51
+ )
52
+
53
+ if self.out_channels == channels:
54
+ self.skip_connection = nn.Identity()
55
+ elif use_conv:
56
+ self.skip_connection = conv_nd(
57
+ dims, channels, self.out_channels, kernel_size, padding=padding
58
+ )
59
+ else:
60
+ self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
61
+
62
+ def forward(self, x):
63
+ if self.do_checkpoint:
64
+ return checkpoint(
65
+ self._forward, x
66
+ )
67
+ else:
68
+ return self._forward(x)
69
+
70
+ def _forward(self, x):
71
+ if self.updown:
72
+ in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
73
+ h = in_rest(x)
74
+ h = self.h_upd(h)
75
+ x = self.x_upd(x)
76
+ h = in_conv(h)
77
+ else:
78
+ h = self.in_layers(x)
79
+ h = self.out_layers(h)
80
+ return self.skip_connection(x) + h
81
+
82
+
83
+ class AudioMiniEncoder(nn.Module):
84
+ def __init__(self,
85
+ spec_dim,
86
+ embedding_dim,
87
+ base_channels=128,
88
+ depth=2,
89
+ resnet_blocks=2,
90
+ attn_blocks=4,
91
+ num_attn_heads=4,
92
+ dropout=0,
93
+ downsample_factor=2,
94
+ kernel_size=3):
95
+ super().__init__()
96
+ self.init = nn.Sequential(
97
+ conv_nd(1, spec_dim, base_channels, 3, padding=1)
98
+ )
99
+ ch = base_channels
100
+ res = []
101
+ self.layers = depth
102
+ for l in range(depth):
103
+ for r in range(resnet_blocks):
104
+ res.append(ResBlock(ch, dropout, dims=1, do_checkpoint=False, kernel_size=kernel_size))
105
+ res.append(Downsample(ch, use_conv=True, dims=1, out_channels=ch*2, factor=downsample_factor))
106
+ ch *= 2
107
+ self.res = nn.Sequential(*res)
108
+ self.final = nn.Sequential(
109
+ normalization(ch),
110
+ nn.SiLU(),
111
+ conv_nd(1, ch, embedding_dim, 1)
112
+ )
113
+ attn = []
114
+ for a in range(attn_blocks):
115
+ attn.append(AttentionBlock(embedding_dim, num_attn_heads, do_checkpoint=False))
116
+ self.attn = nn.Sequential(*attn)
117
+ self.dim = embedding_dim
118
+
119
+ def forward(self, x):
120
+ h = self.init(x)
121
+ h = sequential_checkpoint(self.res, self.layers, h)
122
+ h = self.final(h)
123
+ for blk in self.attn:
124
+ h = checkpoint(blk, h)
125
+ return h[:, :, 0]
126
+
127
+
128
+ class AudioMiniEncoderWithClassifierHead(nn.Module):
129
+ def __init__(self, classes, distribute_zero_label=True, **kwargs):
130
+ super().__init__()
131
+ self.enc = AudioMiniEncoder(**kwargs)
132
+ self.head = nn.Linear(self.enc.dim, classes)
133
+ self.num_classes = classes
134
+ self.distribute_zero_label = distribute_zero_label
135
+
136
+ def forward(self, x, labels=None):
137
+ h = self.enc(x)
138
+ logits = self.head(h)
139
+ if labels is None:
140
+ return logits
141
+ else:
142
+ if self.distribute_zero_label:
143
+ oh_labels = nn.functional.one_hot(labels, num_classes=self.num_classes)
144
+ zeros_indices = (labels == 0).unsqueeze(-1)
145
+ # Distribute 20% of the probability mass on all classes when zero is specified, to compensate for dataset noise.
146
+ zero_extra_mass = torch.full_like(oh_labels, dtype=torch.float, fill_value=.2/(self.num_classes-1))
147
+ zero_extra_mass[:, 0] = -.2
148
+ zero_extra_mass = zero_extra_mass * zeros_indices
149
+ oh_labels = oh_labels + zero_extra_mass
150
+ else:
151
+ oh_labels = labels
152
+ loss = nn.functional.cross_entropy(logits, oh_labels)
153
+ return loss
models/xtransformers.py CHANGED
@@ -13,8 +13,6 @@ from einops.layers.torch import Rearrange
13
  from entmax import entmax15
14
  from torch.utils.checkpoint import checkpoint
15
 
16
- from x_transformers.autoregressive_wrapper import AutoregressiveWrapper
17
-
18
  DEFAULT_DIM_HEAD = 64
19
 
20
  Intermediates = namedtuple('Intermediates', [
 
13
  from entmax import entmax15
14
  from torch.utils.checkpoint import checkpoint
15
 
 
 
16
  DEFAULT_DIM_HEAD = 64
17
 
18
  Intermediates = namedtuple('Intermediates', [
read.py CHANGED
@@ -5,10 +5,11 @@ import torch
5
  import torch.nn.functional as F
6
  import torchaudio
7
 
8
- from api import TextToSpeech, load_conditioning
9
  from utils.audio import load_audio, get_voices
10
  from utils.tokenizer import VoiceBpeTokenizer
11
 
 
12
  def split_and_recombine_text(texts, desired_length=200, max_len=300):
13
  # TODO: also split across '!' and '?'. Attempt to keep quotations together.
14
  texts = [s.strip() + "." for s in texts.split('.')]
@@ -26,13 +27,15 @@ def split_and_recombine_text(texts, desired_length=200, max_len=300):
26
  texts.pop(i+1)
27
  return texts
28
 
 
29
  if __name__ == '__main__':
30
  parser = argparse.ArgumentParser()
31
  parser.add_argument('--textfile', type=str, help='A file containing the text to read.', default="data/riding_hood.txt")
32
  parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
33
- 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='patrick_stewart')
34
  parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/')
35
  parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
 
36
  parser.add_argument('--voice_diversity_intelligibility_slider', type=float,
37
  help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility',
38
  default=.5)
@@ -41,6 +44,9 @@ if __name__ == '__main__':
41
  outpath = args.output_path
42
  voices = get_voices()
43
  selected_voices = args.voice.split(',')
 
 
 
44
  for selected_voice in selected_voices:
45
  voice_outpath = os.path.join(outpath, selected_voice)
46
  os.makedirs(voice_outpath, exist_ok=True)
@@ -67,7 +73,15 @@ if __name__ == '__main__':
67
  for cond_path in cond_paths:
68
  c = load_audio(cond_path, 22050)
69
  conds.append(c)
 
70
  for j, text in enumerate(texts):
 
 
 
71
  gen = tts.tts_with_preset(text, conds, preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
72
- torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), gen.squeeze(0).cpu(), 24000)
 
 
 
 
73
 
 
5
  import torch.nn.functional as F
6
  import torchaudio
7
 
8
+ from api import TextToSpeech, format_conditioning
9
  from utils.audio import load_audio, get_voices
10
  from utils.tokenizer import VoiceBpeTokenizer
11
 
12
+
13
  def split_and_recombine_text(texts, desired_length=200, max_len=300):
14
  # TODO: also split across '!' and '?'. Attempt to keep quotations together.
15
  texts = [s.strip() + "." for s in texts.split('.')]
 
27
  texts.pop(i+1)
28
  return texts
29
 
30
+
31
  if __name__ == '__main__':
32
  parser = argparse.ArgumentParser()
33
  parser.add_argument('--textfile', type=str, help='A file containing the text to read.', default="data/riding_hood.txt")
34
  parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
35
+ 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='pat')
36
  parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/')
37
  parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
38
+ parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None)
39
  parser.add_argument('--voice_diversity_intelligibility_slider', type=float,
40
  help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility',
41
  default=.5)
 
44
  outpath = args.output_path
45
  voices = get_voices()
46
  selected_voices = args.voice.split(',')
47
+ regenerate = args.regenerate
48
+ if regenerate is not None:
49
+ regenerate = [int(e) for e in regenerate.split(',')]
50
  for selected_voice in selected_voices:
51
  voice_outpath = os.path.join(outpath, selected_voice)
52
  os.makedirs(voice_outpath, exist_ok=True)
 
73
  for cond_path in cond_paths:
74
  c = load_audio(cond_path, 22050)
75
  conds.append(c)
76
+ all_parts = []
77
  for j, text in enumerate(texts):
78
+ if regenerate is not None and j not in regenerate:
79
+ all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000))
80
+ continue
81
  gen = tts.tts_with_preset(text, conds, preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
82
+ gen = gen.squeeze(0).cpu()
83
+ torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), gen, 24000)
84
+ all_parts.append(gen)
85
+ full_audio = torch.cat(all_parts, dim=-1)
86
+ torchaudio.save(os.path.join(voice_outpath, 'combined.wav'), full_audio, 24000)
87
 
requirements.txt CHANGED
@@ -6,4 +6,5 @@ tokenizers
6
  inflect
7
  progressbar
8
  einops
9
- unidecode
 
 
6
  inflect
7
  progressbar
8
  einops
9
+ unidecode
10
+ entmax
tortoise_tts.ipynb CHANGED
@@ -17,6 +17,19 @@
17
  "accelerator": "GPU"
18
  },
19
  "cells": [
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  {
21
  "cell_type": "code",
22
  "execution_count": null,
@@ -38,16 +51,14 @@
38
  "import torchaudio\n",
39
  "import torch.nn as nn\n",
40
  "import torch.nn.functional as F\n",
41
- "from tqdm import tqdm\n",
42
  "\n",
43
- "from utils.tokenizer import VoiceBpeTokenizer\n",
44
- "from models.discrete_diffusion_vocoder import DiscreteDiffusionVocoder\n",
45
- "from models.text_voice_clip import VoiceCLIP\n",
46
- "from models.dvae import DiscreteVAE\n",
47
- "from models.autoregressive import UnifiedVoice\n",
48
  "\n",
49
- "# These have some fairly interesting code that is hidden in the colab. Consider checking it out.\n",
50
- "from do_tts import download_models, load_discrete_vocoder_diffuser, load_conditioning, fix_autoregressive_output, do_spectrogram_diffusion"
 
 
 
51
  ],
52
  "metadata": {
53
  "id": "Gen09NM4hONQ"
@@ -58,21 +69,10 @@
58
  {
59
  "cell_type": "code",
60
  "source": [
61
- "# Download pretrained models and set up pretrained voice bank. Feel free to upload and add your own voices here.\n",
62
- "# To do so, upload two WAV files cropped to 5-10 seconds of someone speaking.\n",
63
- "download_models()\n",
64
- "preselected_cond_voices = {\n",
65
- " # Male voices\n",
66
- " 'dotrice': ['voices/dotrice/1.wav', 'voices/dotrice/2.wav'],\n",
67
- " 'harris': ['voices/harris/1.wav', 'voices/harris/2.wav'],\n",
68
- " 'lescault': ['voices/lescault/1.wav', 'voices/lescault/2.wav'],\n",
69
- " 'otto': ['voices/otto/1.wav', 'voices/otto/2.wav'],\n",
70
- " # Female voices\n",
71
- " 'atkins': ['voices/atkins/1.wav', 'voices/atkins/2.wav'],\n",
72
- " 'grace': ['voices/grace/1.wav', 'voices/grace/2.wav'],\n",
73
- " 'kennard': ['voices/kennard/1.wav', 'voices/kennard/2.wav'],\n",
74
- " 'mol': ['voices/mol/1.wav', 'voices/mol/2.wav'],\n",
75
- " }"
76
  ],
77
  "metadata": {
78
  "id": "SSleVnRAiEE2"
@@ -84,12 +84,20 @@
84
  "cell_type": "code",
85
  "source": [
86
  "# This is the text that will be spoken.\n",
87
- "text = \"And took the other as just as fair, and having perhaps the better claim, because it was grassy and wanted wear.\"\n",
88
- "# This is the voice that will speak it.\n",
89
- "voice = 'atkins'\n",
90
- "# This is the number of samples we will generate from the DALLE-style model. More will produce better results, but will take longer to produce.\n",
91
- "# I don't recommend going less than 128.\n",
92
- "num_autoregressive_samples = 128"
 
 
 
 
 
 
 
 
93
  ],
94
  "metadata": {
95
  "id": "bt_aoxONjfL2"
@@ -100,70 +108,20 @@
100
  {
101
  "cell_type": "code",
102
  "source": [
103
- "# Prepare data.\n",
104
- "tokenizer = VoiceBpeTokenizer()\n",
105
- "text = torch.IntTensor(tokenizer.encode(text)).unsqueeze(0).cuda()\n",
106
- "text = F.pad(text, (0,1)) # This may not be necessary.\n",
107
- "cond_paths = preselected_cond_voices[voice]\n",
108
  "conds = []\n",
109
  "for cond_path in cond_paths:\n",
110
- " c, cond_wav = load_conditioning(cond_path)\n",
111
  " conds.append(c)\n",
112
- "conds = torch.stack(conds, dim=1) # And just use the last cond_wav for the diffusion model."
113
- ],
114
- "metadata": {
115
- "id": "KEXOKjIvn6NW"
116
- },
117
- "execution_count": null,
118
- "outputs": []
119
- },
120
- {
121
- "cell_type": "code",
122
- "source": [
123
- "# Load the autoregressive model.\n",
124
- "autoregressive = UnifiedVoice(max_mel_tokens=300, max_text_tokens=200, max_conditioning_inputs=2, layers=30, model_dim=1024,\n",
125
- " heads=16, number_text_tokens=256, start_text_token=255, checkpointing=False, train_solo_embeddings=False).cuda().eval()\n",
126
- "autoregressive.load_state_dict(torch.load('.models/autoregressive.pth'))\n",
127
- "stop_mel_token = autoregressive.stop_mel_token"
128
- ],
129
- "metadata": {
130
- "id": "Z15xFT_uhP8v"
131
- },
132
- "execution_count": null,
133
- "outputs": []
134
- },
135
- {
136
- "cell_type": "code",
137
- "source": [
138
- "# Perform inference with the autoregressive model, generating num_autoregressive_samples\n",
139
- "with torch.no_grad():\n",
140
- " samples = []\n",
141
- " for b in tqdm(range(num_autoregressive_samples // 16)):\n",
142
- " codes = autoregressive.inference_speech(conds, text, num_beams=1, repetition_penalty=1.0, do_sample=True, top_k=50, top_p=.95,\n",
143
- " temperature=.9, num_return_sequences=16, length_penalty=1)\n",
144
- " padding_needed = 250 - codes.shape[1]\n",
145
- " codes = F.pad(codes, (0, padding_needed), value=stop_mel_token)\n",
146
- " samples.append(codes)\n",
147
  "\n",
148
- "# Delete model weights to conserve memory.\n",
149
- "del autoregressive"
 
150
  ],
151
  "metadata": {
152
- "id": "xajqWiEik-j0"
153
- },
154
- "execution_count": null,
155
- "outputs": []
156
- },
157
- {
158
- "cell_type": "code",
159
- "source": [
160
- "# Load the CLIP model.\n",
161
- "clip = VoiceCLIP(dim_text=512, dim_speech=512, dim_latent=512, num_text_tokens=256, text_enc_depth=8, text_seq_len=120, text_heads=8,\n",
162
- " num_speech_tokens=8192, speech_enc_depth=10, speech_heads=8, speech_seq_len=250).cuda().eval()\n",
163
- "clip.load_state_dict(torch.load('.models/clip.pth'))"
164
- ],
165
- "metadata": {
166
- "id": "KNgYSyuyliMs"
167
  },
168
  "execution_count": null,
169
  "outputs": []
@@ -171,75 +129,24 @@
171
  {
172
  "cell_type": "code",
173
  "source": [
174
- "# Use the CLIP model to select the best autoregressive output to match the given text.\n",
175
- "clip_results = []\n",
176
- "with torch.no_grad():\n",
177
- " for batch in samples:\n",
178
- " for i in range(batch.shape[0]):\n",
179
- " batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)\n",
180
- " text = text[:, :120] # Ugly hack to fix the fact that I didn't train CLIP to handle long enough text.\n",
181
- " clip_results.append(clip(text.repeat(batch.shape[0], 1),\n",
182
- " torch.full((batch.shape[0],), fill_value=text.shape[1]-1, dtype=torch.long, device='cuda'),\n",
183
- " batch, torch.full((batch.shape[0],), fill_value=batch.shape[1]*1024, dtype=torch.long, device='cuda'),\n",
184
- " return_loss=False))\n",
185
- " clip_results = torch.cat(clip_results, dim=0)\n",
186
- " samples = torch.cat(samples, dim=0)\n",
187
- " best_results = samples[torch.topk(clip_results, k=1).indices]\n",
188
  "\n",
189
- "# Save samples to CPU memory, delete clip to conserve memory.\n",
190
- "samples = samples.cpu()\n",
191
- "del clip"
192
- ],
193
- "metadata": {
194
- "id": "DDXkM0lclp4U"
195
- },
196
- "execution_count": null,
197
- "outputs": []
198
- },
199
- {
200
- "cell_type": "code",
201
- "source": [
202
- "# Load the DVAE and diffusion model.\n",
203
- "dvae = DiscreteVAE(positional_dims=1, channels=80, hidden_dim=512, num_resnet_blocks=3, codebook_dim=512, num_tokens=8192, num_layers=2,\n",
204
- " record_codes=True, kernel_size=3, use_transposed_convs=False).cuda().eval()\n",
205
- "dvae.load_state_dict(torch.load('.models/dvae.pth'), strict=False)\n",
206
- "diffusion = DiscreteDiffusionVocoder(model_channels=128, dvae_dim=80, channel_mult=[1, 1, 1.5, 2, 3, 4, 6, 8, 8, 8, 8], num_res_blocks=[1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1],\n",
207
- " spectrogram_conditioning_resolutions=[2,512], attention_resolutions=[512,1024], num_heads=4, kernel_size=3, scale_factor=2,\n",
208
- " conditioning_inputs_provided=True, time_embed_dim_multiplier=4).cuda().eval()\n",
209
- "diffusion.load_state_dict(torch.load('.models/diffusion.pth'))\n",
210
- "diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=100)"
211
- ],
212
- "metadata": {
213
- "id": "97acSnBal8Q2"
214
- },
215
- "execution_count": null,
216
- "outputs": []
217
- },
218
- {
219
- "cell_type": "code",
220
- "source": [
221
- "# Decode the (best) discrete sequence created by the autoregressive model.\n",
222
- "with torch.no_grad():\n",
223
- " for b in range(best_results.shape[0]):\n",
224
- " code = best_results[b].unsqueeze(0)\n",
225
- " wav = do_spectrogram_diffusion(diffusion, dvae, diffuser, code, cond_wav, spectrogram_compression_factor=256, mean=True)\n",
226
- " torchaudio.save(f'{voice}_{b}.wav', wav.squeeze(0).cpu(), 22050)"
227
- ],
228
- "metadata": {
229
- "id": "HEDABTrdl_kM"
230
- },
231
- "execution_count": null,
232
- "outputs": []
233
- },
234
- {
235
- "cell_type": "code",
236
- "source": [
237
- "# Listen to your text! (told you that'd take a long time..)\n",
238
- "from IPython.display import Audio\n",
239
- "Audio(data=wav.squeeze(0).cpu().numpy(), rate=22050)"
240
  ],
241
  "metadata": {
242
- "id": "EyHmcdqBmSvf"
243
  },
244
  "execution_count": null,
245
  "outputs": []
 
17
  "accelerator": "GPU"
18
  },
19
  "cells": [
20
+ {
21
+ "cell_type": "markdown",
22
+ "source": [
23
+ "Welcome to Tortoise! 🐢🐢🐢🐢\n",
24
+ "\n",
25
+ "Before you begin, I **strongly** recommend you turn on a GPU runtime.\n",
26
+ "\n",
27
+ "There's a reason this is called \"Tortoise\" - this model takes up to a minute to perform inference for a single sentence on a GPU. Expect waits on the order of hours on a CPU."
28
+ ],
29
+ "metadata": {
30
+ "id": "_pIZ3ZXNp7cf"
31
+ }
32
+ },
33
  {
34
  "cell_type": "code",
35
  "execution_count": null,
 
51
  "import torchaudio\n",
52
  "import torch.nn as nn\n",
53
  "import torch.nn.functional as F\n",
 
54
  "\n",
55
+ "import IPython\n",
 
 
 
 
56
  "\n",
57
+ "from api import TextToSpeech\n",
58
+ "from utils.audio import load_audio, get_voices\n",
59
+ "\n",
60
+ "# This will download all the models used by Tortoise from the HF hub.\n",
61
+ "tts = TextToSpeech()"
62
  ],
63
  "metadata": {
64
  "id": "Gen09NM4hONQ"
 
69
  {
70
  "cell_type": "code",
71
  "source": [
72
+ "# List all the voices available. These are just some random clips I've gathered\n",
73
+ "# from the internet as well as a few voices from the training dataset.\n",
74
+ "# Feel free to add your own clips to the voices/ folder.\n",
75
+ "%ls voices"
 
 
 
 
 
 
 
 
 
 
 
76
  ],
77
  "metadata": {
78
  "id": "SSleVnRAiEE2"
 
84
  "cell_type": "code",
85
  "source": [
86
  "# This is the text that will be spoken.\n",
87
+ "text = \"Joining two modalities results in a surprising increase in generalization! What would happen if we combined them all?\"\n",
88
+ "\n",
89
+ "# Here's something for the poetically inclined.. (set text=)\n",
90
+ "\"\"\"\n",
91
+ "Then took the other, as just as fair,\n",
92
+ "And having perhaps the better claim,\n",
93
+ "Because it was grassy and wanted wear;\n",
94
+ "Though as for that the passing there\n",
95
+ "Had worn them really about the same,\"\"\"\n",
96
+ "\n",
97
+ "# Pick one of the voices from above\n",
98
+ "voice = 'train_dotrice'\n",
99
+ "# Pick a \"preset mode\" to determine quality. Options: {\"ultra_fast\", \"fast\" (default), \"standard\", \"high_quality\"}. See docs in api.py\n",
100
+ "preset = \"fast\""
101
  ],
102
  "metadata": {
103
  "id": "bt_aoxONjfL2"
 
108
  {
109
  "cell_type": "code",
110
  "source": [
111
+ "# Fetch the voice references and forward execute!\n",
112
+ "voices = get_voices()\n",
113
+ "cond_paths = voices[voice]\n",
 
 
114
  "conds = []\n",
115
  "for cond_path in cond_paths:\n",
116
+ " c = load_audio(cond_path, 22050)\n",
117
  " conds.append(c)\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  "\n",
119
+ "gen = tts.tts_with_preset(text, conds, preset)\n",
120
+ "torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)\n",
121
+ "IPython.display.Audio('generated.wav')"
122
  ],
123
  "metadata": {
124
+ "id": "KEXOKjIvn6NW"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  },
126
  "execution_count": null,
127
  "outputs": []
 
129
  {
130
  "cell_type": "code",
131
  "source": [
132
+ "# You can add as many conditioning voices as you want together. Combining\n",
133
+ "# clips from multiple voices takes the mean of the latent space for all\n",
134
+ "# voices. This creates a novel voice that is a combination of the two inputs.\n",
135
+ "#\n",
136
+ "# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:\n",
137
+ "conds = []\n",
138
+ "for v in ['pat', 'william']:\n",
139
+ " cond_paths = voices[v]\n",
140
+ " for cond_path in cond_paths:\n",
141
+ " c = load_audio(cond_path, 22050)\n",
142
+ " conds.append(c)\n",
 
 
 
143
  "\n",
144
+ "gen = tts.tts_with_preset(\"They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.\", conds, preset)\n",
145
+ "torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)\n",
146
+ "IPython.display.Audio('captain_kirkard.wav')"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  ],
148
  "metadata": {
149
+ "id": "fYTk8KUezUr5"
150
  },
151
  "execution_count": null,
152
  "outputs": []
tortoise_v2_examples.html ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <html><head><title>These words were never spoken.</title></head><body><h1>Handpicked results</h1><audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/atkins_mha.mp3" type="audio/mp3"></audio><br>
2
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/atkins_omicron.mp3" type="audio/mp3"></audio><br>
3
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/atkins_value.mp3" type="audio/mp3"></audio><br>
4
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/daniel_craig_dumbledore.mp3" type="audio/mp3"></audio><br>
5
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/daniel_craig_training_ethics.mp3" type="audio/mp3"></audio><br>
6
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/dotrice_stop_for_death.mp3" type="audio/mp3"></audio><br>
7
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/emma_stone_courage.mp3" type="audio/mp3"></audio><br>
8
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/emma_stone_training_ethics.mp3" type="audio/mp3"></audio><br>
9
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/halle_barry_dumbledore.mp3" type="audio/mp3"></audio><br>
10
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/halle_barry_oar_to_oar.mp3" type="audio/mp3"></audio><br>
11
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/henry_cavill_metallic_hydrogen.mp3" type="audio/mp3"></audio><br>
12
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/kennard_road_not_taken.mp3" type="audio/mp3"></audio><br>
13
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/morgan_freeman_metallic_hydrogen.mp3" type="audio/mp3"></audio><br>
14
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/myself_gatsby.mp3" type="audio/mp3"></audio><br>
15
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/patrick_stewart_omicron.mp3" type="audio/mp3"></audio><br>
16
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/patrick_stewart_secret_of_life.mp3" type="audio/mp3"></audio><br>
17
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/robert_deniro_review.mp3" type="audio/mp3"></audio><br>
18
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/william_shatner_spacecraft_interview.mp3" type="audio/mp3"></audio><br>
19
+ <h1>Handpicked longform result:<h1><audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/favorite_riding_hood.mp3" type="audio/mp3"></audio><br>
20
+ <h1>Compared to Tacotron2 (with the LJSpeech voice):</h1><table><th>Tacotron2+Waveglow</th><th>TorToiSe</th><tr><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/2-tacotron2.mp3" type="audio/mp3"></audio><br>
21
+ </td><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/2-tortoise.mp3" type="audio/mp3"></audio><br>
22
+ </td></tr><tr><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/3-tacotron2.mp3" type="audio/mp3"></audio><br>
23
+ </td><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/3-tortoise.mp3" type="audio/mp3"></audio><br>
24
+ </td></tr><tr><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/4-tacotron2.mp3" type="audio/mp3"></audio><br>
25
+ </td><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/4-tortoise.mp3" type="audio/mp3"></audio><br>
26
+ </td></tr></table><h1>Various spoken texts for all voices:<h1><table><th>text</th><th>angie</th><th>daniel</th><th>deniro</th><th>emma</th><th>freeman</th><th>geralt</th><th>halle</th><th>jlaw</th><th>lj</th><th>myself</th><th>pat</th><th>snakes</th><th>tom</th><th>train_atkins</th><th>train_dotrice</th><th>train_kennard</th><th>weaver</th><th>william</th>
27
+ <tr><td>autoregressive_ml</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/autoregressive_ml/william.mp3" type="audio/mp3"></audio></td></tr>
28
+ <tr><td>bengio_it_needs_to_know_what_is_bad</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/bengio_it_needs_to_know_what_is_bad/william.mp3" type="audio/mp3"></audio></td></tr>
29
+ <tr><td>dickinson_stop_for_death</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/dickinson_stop_for_death/william.mp3" type="audio/mp3"></audio></td></tr>
30
+ <tr><td>espn_basketball</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/espn_basketball/william.mp3" type="audio/mp3"></audio></td></tr>
31
+ <tr><td>frost_oar_to_oar</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_oar_to_oar/william.mp3" type="audio/mp3"></audio></td></tr>
32
+ <tr><td>frost_road_not_taken</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/frost_road_not_taken/william.mp3" type="audio/mp3"></audio></td></tr>
33
+ <tr><td>gatsby_and_so_we_beat_on</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/gatsby_and_so_we_beat_on/william.mp3" type="audio/mp3"></audio></td></tr>
34
+ <tr><td>harrypotter_differences_of_habit_and_language</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/harrypotter_differences_of_habit_and_language/william.mp3" type="audio/mp3"></audio></td></tr>
35
+ <tr><td>i_am_a_language_model</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/i_am_a_language_model/william.mp3" type="audio/mp3"></audio></td></tr>
36
+ <tr><td>melodie_kao</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/melodie_kao/william.mp3" type="audio/mp3"></audio></td></tr>
37
+ <tr><td>nyt_covid</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/nyt_covid/william.mp3" type="audio/mp3"></audio></td></tr>
38
+ <tr><td>real_courage_is_when_you_know_your_licked</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/real_courage_is_when_you_know_your_licked/william.mp3" type="audio/mp3"></audio></td></tr>
39
+ <tr><td>rolling_stone_review</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/rolling_stone_review/william.mp3" type="audio/mp3"></audio></td></tr>
40
+ <tr><td>spacecraft_interview</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/spacecraft_interview/william.mp3" type="audio/mp3"></audio></td></tr>
41
+ <tr><td>tacotron2_sample1</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample1/william.mp3" type="audio/mp3"></audio></td></tr>
42
+ <tr><td>tacotron2_sample2</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample2/william.mp3" type="audio/mp3"></audio></td></tr>
43
+ <tr><td>tacotron2_sample3</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample3/william.mp3" type="audio/mp3"></audio></td></tr>
44
+ <tr><td>tacotron2_sample4</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/tacotron2_sample4/william.mp3" type="audio/mp3"></audio></td></tr>
45
+ <tr><td>watts_this_is_the_real_secret_of_life</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/watts_this_is_the_real_secret_of_life/william.mp3" type="audio/mp3"></audio></td></tr>
46
+ <tr><td>wilde_nowadays_people_know_the_price</td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/angie.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/daniel.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/deniro.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/emma.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/freeman.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/geralt.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/halle.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/jlaw.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/lj.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/myself.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/pat.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/snakes.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/tom.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/train_atkins.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/train_dotrice.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/train_kennard.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/weaver.mp3" type="audio/mp3"></audio></td><td><audio controls="" style="width: 150px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/various/wilde_nowadays_people_know_the_price/william.mp3" type="audio/mp3"></audio></td></tr></table><h1>Longform result for all voices:</h1><audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/angelina.mp3" type="audio/mp3"></audio><br>
47
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/craig.mp3" type="audio/mp3"></audio><br>
48
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/deniro.mp3" type="audio/mp3"></audio><br>
49
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/emma.mp3" type="audio/mp3"></audio><br>
50
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/freeman.mp3" type="audio/mp3"></audio><br>
51
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/geralt.mp3" type="audio/mp3"></audio><br>
52
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/halle.mp3" type="audio/mp3"></audio><br>
53
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/jlaw.mp3" type="audio/mp3"></audio><br>
54
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/lj.mp3" type="audio/mp3"></audio><br>
55
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/myself.mp3" type="audio/mp3"></audio><br>
56
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/pat.mp3" type="audio/mp3"></audio><br>
57
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/snakes.mp3" type="audio/mp3"></audio><br>
58
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/tom.mp3" type="audio/mp3"></audio><br>
59
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/weaver.mp3" type="audio/mp3"></audio><br>
60
+ <audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/william.mp3" type="audio/mp3"></audio><br>
61
+ </body></html>
utils/diffusion.py CHANGED
@@ -605,7 +605,7 @@ class GaussianDiffusion:
605
  img = th.randn(*shape, device=device)
606
  indices = list(range(self.num_timesteps))[::-1]
607
 
608
- for i in tqdm(indices):
609
  t = th.tensor([i] * shape[0], device=device)
610
  with th.no_grad():
611
  out = self.p_sample(
@@ -774,7 +774,7 @@ class GaussianDiffusion:
774
  # Lazy import so that we don't depend on tqdm.
775
  from tqdm.auto import tqdm
776
 
777
- indices = tqdm(indices)
778
 
779
  for i in indices:
780
  t = th.tensor([i] * shape[0], device=device)
 
605
  img = th.randn(*shape, device=device)
606
  indices = list(range(self.num_timesteps))[::-1]
607
 
608
+ for i in tqdm(indices, disable=not progress):
609
  t = th.tensor([i] * shape[0], device=device)
610
  with th.no_grad():
611
  out = self.p_sample(
 
774
  # Lazy import so that we don't depend on tqdm.
775
  from tqdm.auto import tqdm
776
 
777
+ indices = tqdm(indices, disable=not progress)
778
 
779
  for i in indices:
780
  t = th.tensor([i] * shape[0], device=device)
voices/.gitattributes ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ train_dotrice filter=lfs diff=lfs merge=lfs -text
2
+ train_lescault filter=lfs diff=lfs merge=lfs -text
3
+ kennard filter=lfs diff=lfs merge=lfs -text
4
+ myself filter=lfs diff=lfs merge=lfs -text
5
+ snakes filter=lfs diff=lfs merge=lfs -text
6
+ train_atkins filter=lfs diff=lfs merge=lfs -text
7
+ emma filter=lfs diff=lfs merge=lfs -text
8
+ geralt filter=lfs diff=lfs merge=lfs -text
9
+ jlaw filter=lfs diff=lfs merge=lfs -text
10
+ lj filter=lfs diff=lfs merge=lfs -text
11
+ pat filter=lfs diff=lfs merge=lfs -text
12
+ train_grace filter=lfs diff=lfs merge=lfs -text
13
+ weaver filter=lfs diff=lfs merge=lfs -text
14
+ william filter=lfs diff=lfs merge=lfs -text
15
+ angie filter=lfs diff=lfs merge=lfs -text
16
+ craig filter=lfs diff=lfs merge=lfs -text
17
+ halle filter=lfs diff=lfs merge=lfs -text
18
+ tom filter=lfs diff=lfs merge=lfs -text
19
+ deniro filter=lfs diff=lfs merge=lfs -text
20
+ freeman filter=lfs diff=lfs merge=lfs -text
21
+ mol filter=lfs diff=lfs merge=lfs -text
22
+ mol/1.wav filter=lfs diff=lfs merge=lfs -text
23
+ myself/2.wav filter=lfs diff=lfs merge=lfs -text
24
+ tom/4.wav filter=lfs diff=lfs merge=lfs -text
25
+ train_grace/1.wav filter=lfs diff=lfs merge=lfs -text
26
+ william/3.wav filter=lfs diff=lfs merge=lfs -text
27
+ freeman/3.wav filter=lfs diff=lfs merge=lfs -text
28
+ jlaw/4.wav filter=lfs diff=lfs merge=lfs -text
29
+ lj/1.wav filter=lfs diff=lfs merge=lfs -text
30
+ myself/1.wav filter=lfs diff=lfs merge=lfs -text
31
+ geralt/1.wav filter=lfs diff=lfs merge=lfs -text
32
+ lj/2.wav filter=lfs diff=lfs merge=lfs -text
33
+ pat/3.wav filter=lfs diff=lfs merge=lfs -text
34
+ tom/2.wav filter=lfs diff=lfs merge=lfs -text
35
+ deniro/2.wav filter=lfs diff=lfs merge=lfs -text
36
+ jlaw/3.wav filter=lfs diff=lfs merge=lfs -text
37
+ angie/2.wav filter=lfs diff=lfs merge=lfs -text
38
+ deniro/1.wav filter=lfs diff=lfs merge=lfs -text
39
+ deniro/3.wav filter=lfs diff=lfs merge=lfs -text
40
+ jlaw/1.wav filter=lfs diff=lfs merge=lfs -text
41
+ myself/3.wav filter=lfs diff=lfs merge=lfs -text
42
+ william/2.wav filter=lfs diff=lfs merge=lfs -text
43
+ pat/1.wav filter=lfs diff=lfs merge=lfs -text
44
+ snakes/2.wav filter=lfs diff=lfs merge=lfs -text
45
+ tom/1.wav filter=lfs diff=lfs merge=lfs -text
46
+ train_grace/2.wav filter=lfs diff=lfs merge=lfs -text
47
+ weaver/2.wav filter=lfs diff=lfs merge=lfs -text
48
+ craig/1.wav filter=lfs diff=lfs merge=lfs -text
49
+ emma/3.wav filter=lfs diff=lfs merge=lfs -text
50
+ freeman/1.wav filter=lfs diff=lfs merge=lfs -text
51
+ mol/2.wav filter=lfs diff=lfs merge=lfs -text
52
+ geralt/3.wav filter=lfs diff=lfs merge=lfs -text
53
+ kennard/2.wav filter=lfs diff=lfs merge=lfs -text
54
+ pat/4.wav filter=lfs diff=lfs merge=lfs -text
55
+ train_dotrice/2.wav filter=lfs diff=lfs merge=lfs -text
56
+ train_lescault/2.wav filter=lfs diff=lfs merge=lfs -text
57
+ william/1.wav filter=lfs diff=lfs merge=lfs -text
58
+ angie/3.wav filter=lfs diff=lfs merge=lfs -text
59
+ deniro/4.wav filter=lfs diff=lfs merge=lfs -text
60
+ emma/2.wav filter=lfs diff=lfs merge=lfs -text
61
+ halle/1.wav filter=lfs diff=lfs merge=lfs -text
62
+ halle/2.wav filter=lfs diff=lfs merge=lfs -text
63
+ weaver/3.wav filter=lfs diff=lfs merge=lfs -text
64
+ train_atkins/1.wav filter=lfs diff=lfs merge=lfs -text
65
+ weaver/1.wav filter=lfs diff=lfs merge=lfs -text
66
+ angie/1.wav filter=lfs diff=lfs merge=lfs -text
67
+ craig/3.wav filter=lfs diff=lfs merge=lfs -text
68
+ jlaw/2.wav filter=lfs diff=lfs merge=lfs -text
69
+ kennard/1.wav filter=lfs diff=lfs merge=lfs -text
70
+ snakes/3.wav filter=lfs diff=lfs merge=lfs -text
71
+ train_atkins/2.wav filter=lfs diff=lfs merge=lfs -text
72
+ snakes/1.wav filter=lfs diff=lfs merge=lfs -text
73
+ tom/3.wav filter=lfs diff=lfs merge=lfs -text
74
+ train_dotrice/1.wav filter=lfs diff=lfs merge=lfs -text
75
+ craig/2.wav filter=lfs diff=lfs merge=lfs -text
76
+ geralt/2.wav filter=lfs diff=lfs merge=lfs -text
77
+ halle/3.wav filter=lfs diff=lfs merge=lfs -text
78
+ emma/1.wav filter=lfs diff=lfs merge=lfs -text
79
+ train_lescault/1.wav filter=lfs diff=lfs merge=lfs -text
80
+ craig/4.wav filter=lfs diff=lfs merge=lfs -text
81
+ freeman/2.wav filter=lfs diff=lfs merge=lfs -text
82
+ pat/2.wav filter=lfs diff=lfs merge=lfs -text
83
+ william/4.wav filter=lfs diff=lfs merge=lfs -text