salmaniq's picture
Upload 152 files
a72b927

TODO

  • FastAPI

    • Generate:
      • other params? (temperature, rvc f0_method, min_eos_p)
  • Bark TTS

    • Set speech sentiment in request body, then prefix each sentence e.g. [Happy]<input_text>
    • Should I aim to combine sentences which will fit in the largest clip length? (14s?)
      • More consistent tone etc in bark output?
    • Should map the rvc model to a chosen bark voice (incl. default)?
      • Or set via request body?
      • Currently hardcoded for bark voice v2/en_speaker_9 (works well with all tested RVC models, regardless of gender etc)
  • RVC

    • Should I re-use the Config details? (GPU info etc)
      • Set CUDA_VISIBLE_DEVICES for bark
    • Should I load hubert model for each request? Precious VRAM
  • Smaller container image

    • Currently used to confirm app dependencies only, don't care that it's 6 GiB

Issues

  • Splits on sentences. So a single sentence which takes longer than ~14 seconds will be a mess
  • Significantly slower bark generation when ran via API vs directly in python script
    • Generation time roughly equals audio length (tested on 3090)