Cretan XLS-R model

Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between Greece and Turkey in 1923. The historical and geographical factors that have shaped the development and preservation of the dialect include the long-term isolation of Crete from the mainland, and the successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks, over a period of seven centuries. Cretan has been divided based on its phonological, phonetic, morphological, and lexical characteristics into two major dialect groups: the western and the eastern. The boundary between these groups coincides with the administrative division of the island into the prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more homogeneous than the western one, which shows more variation across all levels of linguistic analysis. Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains the sole means of communication for a large number of speakers in various parts of the island.

This is the first automatic speech recognition (ASR) model for Cretan. To train the model, we fine-tuned a Greek XLS-R model (jonatasgrosman/wav2vec2-large-xlsr-53-greek) on 11h of recorded Pomak speech.

Resources

For the compilation of the Cretan corpus, we gathered 32 tapes containing material from
radio broadcasts in digital format, with permission from the Audiovisual Department of the Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001, totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms of textual genre, the linguistic content of the broadcasts consists of folklore narratives expressed in the local linguistic variety. Out of the total volume of material collected, we utilized nine tapes. Criteria for material selection included, on the one hand, maximizing digital clarity of speech and, on the other hand, ensuring representative sampling across the entire three-year period of radio recordings. To obtain an initial transcription, we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently, the transcripts were manually corrected in collaboration with the local community. The transcription system that was used was based on the Greek alphabet and orthography and it was annotated in Praat.

To prepare the dataset, the texts were normalized (see greek_dialects_asr/ for scripts), and all audio files were converted into a 16 kHz mono format.

We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s. Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration (compared to the initial 2h recordings of the 9 tapes).

Metrics

We evaluated the model on the test set split, which consists of 10% of the dataset recordings.

Model	WER	CER
pre-trained	104.83%	91.73%
fine-tuned	28.27%	7.88%

Training hyperparameters

We fine-tuned the baseline model (wav2vec2-large-xlsr-53-greek) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:

arg	value
`per_device_train_batch_size`	8
`gradient_accumulation_steps`	2
`num_train_epochs`	35
`learning_rate`	3e-4
`warmup_steps`	500

Citation

To cite this work or read more about the training pipeline, see:

S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024.