Greek Tokenizer
Tokenizer trained from scratch based on BPE algorithm on Greek corpus.
Usage:
To use this tokenizer, you can load it from the Hugging Face Hub:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
Example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
# Tokenize input text
input_text = "Αυτό είναι ένα παράδειγμα."
inputs = tokenizer(input_text, return_tensors="pt")
# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())
# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)
# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)
It can also be used as a head start for pretraining a GPT2 base model on the Greek language.
Training Details
Vocabulary Size: 52000
Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
Benefits and Why to use:
Many generic tokenizers split words in multiple tokens.
This tokenizer, efficient and only splits words that was not trained on.
In the example above , the output of this tokenizer is only five tokens, while another tokenizer e.g. Llama-3 results in 9 or more tokens.
This can have an impact in inference costs and downstream applications.
Update
July 2024:
A new version is imminent that further decrease the fertility for Greek and English language.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.