File size: 2,660 Bytes
f9c22f9
22c5546
20d4698
 
 
 
 
 
 
 
 
 
4c3bb5d
22c5546
 
 
 
 
 
 
 
 
 
 
 
f9c22f9
20d4698
 
3efc245
20d4698
60e70c7
777f1e3
20d4698
 
267a0c5
08732c6
 
267a0c5
20d4698
b2cb9e5
 
 
85851cc
b2cb9e5
 
85851cc
b2cb9e5
6426852
20d4698
 
 
 
 
 
1e38ea9
 
20d4698
 
b2cb9e5
20d4698
b2cb9e5
20d4698
 
 
 
 
 
 
 
 
 
b2cb9e5
 
1e38ea9
60e70c7
 
 
ff20bfd
60e70c7
 
20d4698
 
 
6426852
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
language:
- bn
metrics:
- wer
- cer
tags:
- seq2seq
- ipa
- bengali
- byt5
widget:
- text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।
  example_title: Narail Text
- text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।
  example_title: Rangpur Text
- text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!
  example_title: Chittagong Text
- text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।
  example_title: Kishoreganj Text
- text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।
  example_title: Narsingdi Text
- text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?
  example_title: Tangail Text
---


# Regional bengali text to IPA transcription - byT5-small


This is a fine-tuned version of the [google/byt5-small](https://huggingface.co/google/byt5-small) for the task of generating IPA transcriptions from regional bengali text. 
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI.

Model performance:
- **Word error rate (wer)**: 0.0124279344454407
- **Char error rate (cer)**: 0.00427635805681347


Supported district tokens:
- Kishoreganj
- Narail
- Narsingdi
- Chittagong
- Rangpur
- Tangail

---

## Loading & using the model
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-byt5small")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-byt5small")

"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)
```


## Using the pipeline
```python
# Use a pipeline as a high-level helper
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-byt5small", device=device)


"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=1024, batch_size=batch_size)
```

## Credits
Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15)