--- tags: - greek - tokenization - bpe license: mit language: - el --- # Greek Tokenizer Tokenizer trained from scratch based on BPE algorithm on Greek corpus. ### Usage: To use this tokenizer, you can load it from the Hugging Face Hub: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer") ``` ### Example: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer") # Tokenize input text input_text = "Αυτό είναι ένα παράδειγμα." inputs = tokenizer(input_text, return_tensors="pt") # Print the tokenized input (IDs and tokens) print("Token IDs:", inputs["input_ids"].tolist()) # Convert token IDs to tokens tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) print("Tokens:", tokens) # Manually join tokens to form the tokenized string tokenized_string = ' '.join(tokens) print("Tokenized String:", tokenized_string) ``` It can also be used as a head start for pretraining a GPT2 base model on the Greek language. ## Training Details Vocabulary Size: 52000 Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK] ## Benefits and Why to use: Many generic tokenizers split words in multiple tokens. This tokenizer, efficient and only splits words that was not trained on. In the example above , the output of this tokenizer is only five tokens, while another tokenizer *e.g. Llama-3* results in 9 or more tokens. This can have an impact in inference costs and downstream applications. ## Update July 2024: A new version is imminent that further decrease the fertility for Greek and English language.