Spaces:
Configuration error
Voice Customization Guide
Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
Provided voices
This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see what Tortoise can do for zero-shot mimicking, take a look at the others.
Adding a new voice
To add new voices to Tortoise, you will need to do the following:
- Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
- Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
- Save the clips as a WAV file with floating point format and a 22,050 sample rate.
- Create a subdirectory in voices/
- Put your clips in that subdirectory.
- Run tortoise utilities with --voice=.
Picking good reference clips
As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:
- Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
- Avoid speeches. These generally have distortion caused by the amplification system.
- Avoid clips from phone calls.
- Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
- Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
- The text being spoken in the clips does not matter, but diverse text does seem to perform better.