ArTST - Arabic Text Speech Transformer
Collection
Open source project for Arabic Speech Recognition and Generation
โข
6 items
โข
Updated
โข
1
SpeechT5 for Arabic (TTS task)
Here we use the pretained weights from ArTST and fine-tuned using huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech).
ArTST was first released in this repository, pretrained weights.
You can run ArTST TTS locally with the ๐ค Transformers library.
pip install --upgrade pip
pip install --upgrade transformers sentencepiece datasets[audio]
Text-to-Speech
(TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code!from transformers import pipeline
from datasets import load_dataset
import soundfile as sf
synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.
speech = synthesiser("ูุฃูู ูุง ูุฑู ุฃูู ุนูู ุงูุณูู ุซู
ู
ู ุจุนุฏ ุฐูู ุญุฏูุซ ู
ูุชุดุฑ", forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from datasets import load_dataset
processor = SpeechT5Processor.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
model = SpeechT5ForTextToSpeech.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
inputs = processor(text="ูุฃูู ูุง ูุฑู ุฃูู ุนูู ุงูุณูู ุซู
ู
ู ุจุนุฏ ุฐูู ุญุฏูุซ ู
ูุชุดุฑ", return_tensors="pt")
# load xvector containing speaker's voice characteristics from a dataset
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
sf.write("speech.wav", speech.numpy(), samplerate=16000)
BibTeX:
@inproceedings{toyin-etal-2023-artst,
title = "{A}r{TST}: {A}rabic Text and Speech Transformer",
author = "Toyin, Hawau and
Djanibekov, Amirbek and
Kulkarni, Ajinkya and
Aldarmaki, Hanan",
editor = "Sawaf, Hassan and
El-Beltagy, Samhaa and
Zaghouani, Wajdi and
Magdy, Walid and
Abdelali, Ahmed and
Tomeh, Nadi and
Abu Farha, Ibrahim and
Habash, Nizar and
Khalifa, Salam and
Keleg, Amr and
Haddad, Hatem and
Zitouni, Imed and
Mrini, Khalil and
Almatham, Rawan",
booktitle = "Proceedings of ArabicNLP 2023",
month = dec,
year = "2023",
address = "Singapore (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.arabicnlp-1.5",
pages = "41--51"
}
@inproceedings{ao-etal-2022-speecht5,
title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {May},
year = {2022},
pages={5723--5738},
}