---
tags:
- greek
- tokenization
- bpe
license: mit
language:
- el
---

# Greek Tokenizer

Tokenizer trained from scratch based on BPE algorithm on Greek corpus.

### Usage:
To use this tokenizer, you can load it from the Hugging Face Hub:


```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
```

### Example:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")

# Tokenize input text
input_text = "Αυτό είναι ένα παράδειγμα."
inputs = tokenizer(input_text, return_tensors="pt")

# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())

# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)

# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)
```

It can also be used as a head start for pretraining a GPT2 base model on the Greek language.


## Training Details
Vocabulary Size: 52000

Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]


## Benefits and Why to use:

Many generic tokenizers split words in multiple tokens.

This tokenizer, efficient and only splits words that was not trained on.

In the example above , the output of this tokenizer is only five tokens, while another tokenizer *e.g. Llama-3* results in 9 or more tokens.

This can have an impact in inference costs and downstream applications.

## Update

July 2024:

A new version is imminent that further decrease the fertility for Greek and English language.