File size: 2,422 Bytes
5b03452
 
 
584c827
5b03452
584c827
5b03452
 
 
 
 
 
aa96e23
 
5b03452
e77e072
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa96e23
 
e77e072
 
 
 
 
 
 
 
 
 
 
 
 
 
aa96e23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language:
- en
- bn
library_name: transformers
license: apache-2.0
tags:
- transformers
- gemma2
- gemma
---

# rishiraj/gemma-2-9b-bn
This repository extends the `google/gemma-2-9b` tokenizer by training it on Bengali text. The original tokenizer splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.

## Token Information

| Tokenizer                         | Number of Tokens |
|------------------------------------|------------------|
| `google/gemma-2-9b`                | 256,000          |
| `rishiraj/gemma-2-9b-bn`           | 392,402          |

### Why Fewer Tokens for Bengali?
While Bengali is very expressive and flexible, it hasn't undergone as much global influence as English in terms of absorbing new words from many different languages.

## Tokenizer Comparison

**Text:**
```text
আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি
```

| Tokenizer                  | Output                                                                                                               |
|----------------------------|----------------------------------------------------------------------------------------------------------------------|
| `google/gemma-2-9b`         | ['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি'] |
| `rishiraj/gemma-2-9b-bn`    | ['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি']                                                      |

## Usage

1. Install dependencies:
   ```bash
   pip install transformers
   ```

2. Load and use the tokenizer:
   ```python
   from transformers import AutoTokenizer
   tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
   tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
   print(tokens)
   ```