TorToiSe is a text-to-speech program built in April 2022 by jbetker@. TorToiSe is open source, with trained model weights available at https://github.com/neonbjb/tortoise-tts
This page demonstrates some of the results of TorToiSe.
Following are several particularly good results generated by the model.
LJSpeech is a popular dataset used to train small-scale TTS models. TorToiSe is a multi-voice model, following is how it renders the LJSpeech voice with and without fine-tuning, compared with results for the same text from the popular Tacotron2 model paired with the Waveglow vocoder.
Tacotron2+Waveglow | TorToiSe | TorToiSe Finetuned |
---|---|---|
NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody and end-to-end generation with no need for a vocoder. While not much has actually been released about this model other than five samples, those samples are quite good and I would consider this the most competitive TTS engine out there right now.
Natural Voice | TorToiSe Finetuned |
---|---|
It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice probabilistic model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and NaturalVoice are efficient, fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually comparable research to Tortoise.
Following are all the results from which the hand-picked results were drawn from. Also included is the reference audio that the program is trying to mimic. This will give you a better sense of how TorToiSe really performs.
text | angie | daniel | deniro | emma | freeman | geralt | halle | jlaw | lj | myself | pat | snakes | tom | train_atkins | train_dotrice | train_kennard | weaver | william |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
reference clip | ||||||||||||||||||
autoregressive_ml | ||||||||||||||||||
bengio_it_needs_to_know_what_is_bad | ||||||||||||||||||
dickinson_stop_for_death | ||||||||||||||||||
espn_basketball | ||||||||||||||||||
frost_oar_to_oar | ||||||||||||||||||
frost_road_not_taken | ||||||||||||||||||
gatsby_and_so_we_beat_on | ||||||||||||||||||
harrypotter_differences_of_habit_and_language | ||||||||||||||||||
i_am_a_language_model | ||||||||||||||||||
melodie_kao | ||||||||||||||||||
nyt_covid | ||||||||||||||||||
real_courage_is_when_you_know_your_licked | ||||||||||||||||||
rolling_stone_review | ||||||||||||||||||
spacecraft_interview | ||||||||||||||||||
tacotron2_sample1 | ||||||||||||||||||
tacotron2_sample2 | ||||||||||||||||||
tacotron2_sample3 | ||||||||||||||||||
tacotron2_sample4 | ||||||||||||||||||
watts_this_is_the_real_secret_of_life | ||||||||||||||||||
wilde_nowadays_people_know_the_price |
Tortoise is capable of "prompt-engineering" in that tone and prosody is affected by the emotions inflected in the words fed to the program. For example, prompting the model with "[I am so angry,] I went to the park and threw a ball" will result in it outputting "I went to the park and threw the ball" with an angry tone.
Following are a few examples of different prompts. The effect is subtle, but is definitely there. Many voices are less effected by this.
Angry: