|
--- |
|
license: mit |
|
language: |
|
- bn |
|
metrics: |
|
- wer |
|
- cer |
|
tags: |
|
- seq2seq |
|
- ipa |
|
- bengali |
|
- byt5 |
|
--- |
|
|
|
|
|
# Regional bengali text to IPA transcription - byT5-small |
|
|
|
This is a fine-tuned version of the [umt5-base](https://huggingface.co/google/umt5-base) for the task of generating IPA transcriptions from regional bengali text. |
|
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI. |
|
|
|
Best scores achieved in the leaderboards: |
|
- **Public score**: 0.01995 |
|
- **Private score**: 0.02072 |
|
|
|
|
|
## Loading & using the model |
|
```python |
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("smji/ben2ipa-byt5small") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("smji/ben2ipa-byt5small") |
|
|
|
""" |
|
The format of the input text must be: <district> <bengali_text> |
|
""" |
|
text = "<Chittagong> bengali_text_here" |
|
text_ids = tokenizer(text, return_tensors='pt').input_ids |
|
model(text_ids) |
|
``` |
|
|
|
|
|
## Using the pipeline |
|
```python |
|
# Use a pipeline as a high-level helper |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("text2text-generation", model="smji/ben2ipa-byt5small", device=device) |
|
``` |
|
|
|
## Credits |
|
Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://github.com/sadia-ahmmed), [Sahid Hossain Mustakim](https://github.com/sratul35) |