Introduction 🐢

TorToiSe is a text-to-speech program built in April 2022 by jbetker@. TorToiSe is open source, with trained model weights available at https://github.com/neonbjb/tortoise-tts

This page demonstrates some of the results of TorToiSe.

Handpicked results 🐢

Following are several particularly good results generated by the model.

Short-form

Long-form

Comparisons (with the LJSpeech voice): 🐢

LJSpeech is a popular dataset used to train small-scale TTS models. TorToiSe is a multi-voice model, following is how it renders the LJSpeech voice with and without fine-tuning, compared with results for the same text from the popular Tacotron2 model paired with the Waveglow vocoder.

Tacotron2+Waveglow	TorToiSe	TorToiSe Finetuned

NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody and end-to-end generation with no need for a vocoder. While not much has actually been released about this model other than five samples, those samples are quite good and I would consider this the most competitive TTS engine out there right now.

Natural Voice	TorToiSe Finetuned

It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice probabilistic model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and NaturalVoice are efficient, fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually comparable research to Tortoise.

All Results 🐢

Following are all the results from which the hand-picked results were drawn from. Also included is the reference audio that the program is trying to mimic. This will give you a better sense of how TorToiSe really performs.

Short-form

text	angie	daniel	deniro	emma	freeman	geralt	halle	jlaw	lj	myself	pat	snakes	tom	train_atkins	train_dotrice	train_kennard	weaver	william
reference clip
autoregressive_ml
bengio_it_needs_to_know_what_is_bad
dickinson_stop_for_death
espn_basketball
frost_oar_to_oar
frost_road_not_taken
gatsby_and_so_we_beat_on
harrypotter_differences_of_habit_and_language
i_am_a_language_model
melodie_kao
nyt_covid
real_courage_is_when_you_know_your_licked
rolling_stone_review
spacecraft_interview
tacotron2_sample1
tacotron2_sample2
tacotron2_sample3
tacotron2_sample4
watts_this_is_the_real_secret_of_life
wilde_nowadays_people_know_the_price

Long-form

Angelina:

Craig:

Deniro:

Emma:

Freeman:

Geralt:

Halle:

Jlaw:

LJ:

Myself:

Pat:

Snakes:

Tom:

Weaver:

William:

Prompt Engineering 🐢

Tortoise is capable of "prompt-engineering" in that tone and prosody is affected by the emotions inflected in the words fed to the program. For example, prompting the model with "[I am so angry,] I went to the park and threw a ball" will result in it outputting "I went to the park and threw the ball" with an angry tone.

Following are a few examples of different prompts. The effect is subtle, but is definitely there. Many voices are less effected by this.

Angry:

Sad:

Happy:

Scared: