metadata
license: mit
language:
- bn
metrics:
- wer
- cer
tags:
- seq2seq
- ipa
- bengali
- byt5
Regional bengali text to IPA transcription - byT5-small
A word of caution: the model is constantly being updated! You may see jumps in performance
This is a fine-tuned version of the google/byt5-small for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition “ভাষামূল: মুখের ভাষার খোঁজে“ by Bengali.AI.
Best scores achieved in the leaderboards:
- Public score: 0.01995
- Private score: 0.02072
Supported district tokens:
- Kishoreganj
- Narail
- Narsingdi
- Chittagong
- Rangpur
- Tangail
Loading & using the model
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("smji/ben2ipa-byt5small")
model = AutoModelForSeq2SeqLM.from_pretrained("smji/ben2ipa-byt5small")
"""
The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)
Using the pipeline
# Use a pipeline as a high-level helper
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text2text-generation", model="smji/ben2ipa-byt5small", device=device)
"""
Texts must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=1024, batch_size=batch_size)
Credits
Done by S M Jishanul Islam, Sadia Ahmmed, Sahid Hossain Mustakim