Update README.md
Browse filesExperimental trainings of the SpeechT5 model were carried out with the aim of adapting the basic model for the use of text-to-speech conversion.
As the original SpeechT5 model was trained on tasks exclusively in English (LibriTTS dataset), it was necessary to implement the training of the new model, on the available data originating from a group of regional languages (Montenegrin, Serbian, Bosnian, Croatian). One of the popular open datasets for this use is the VoxPopuli set, which contains sound recordings of the European Parliament from 2009 to 2020. Given that data in all regional languages is not available to the required extent, data in the Croatian language, which is the most represented, was taken from the VoxPopuli dataset. In the next stages of the project, data will be collected in Montenegrin, Serbian and Bosnian languages, in order to improve the quality of training and the accuracy of the model.
Thus, the final dataset consists of 43 transcribed hours of speech, 83 different speakers and 337 thousand transcribed tokens (1 token = 3/4 words).
In the first phase of technical implementation, the dataset went through several stages of processing in order to adapt and standardize it for training the SpeechT5 model. Data processing methods belong to the standard methods of linguistic data manipulation in the field of natural language processing (vocabulary formation, tokenization, removal or conversion of unsupported characters/letters, text/speech cleaning, text normalization).
In the next phase, the statistics of speakers in the VoxPopuli dataset were analyzed, based on which speakers with satisfactory text/speech quality and a sufficient number of samples for model training were selected. In this phase, the balancing of the dataset was carried out so that both male and female speakers, with high-quality text/speech samples, were equally represented in the training.
After the preparation of the data, the adjustment and optimization of the hyperparameters of the SpeechT5 model, which are necessary so that the training of the model can be performed quickly and efficiently, with satisfactory accuracy, was started. Several experimental training sessions were performed to obtain the optimal hyperparameters, which were then used in the evaluation phase of the model.
The evaluation of the model, on the dataset intended for testing, showed promising results. The model started to learn on the prepared dataset, but it also showed certain limitations. The main limitation is related to the length of the input text sequence. The model showed the inability to generate speech for long sequences of input text (over 20 words). The limitation was overcome by dividing the input sequence into smaller units and in that form passed to the model for processing. The main reason for the appearance of this limitation lies primarily in the lack of a large amount of data on which it is necessary to fine-tune the model in order to obtain the best possible results.