--- language: - el tags: - pytorch - causal-lm widget: - text: "Το αγαπημένο μου μέρος είναι" license: apache-2.0 --- # Greek (el) GPT2 model - small ### By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC) * language: el * licence: apache-2.0 * dataset: ~5GB of Greek corpora * model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language) * pre-processing: tokenization + BPE segmentation ### Model description A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2(small). Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages. Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing ### How to use ``` from transformers import pipeline model = "lighteternal/gpt2-finetuned-greek-small" generator = pipeline( 'text-generation', device=0, model=f'{model}', tokenizer=f'{model}') text = "Μια φορά κι έναν καιρό" print("\n".join([x.get("generated_text") for x in generator( text, max_length=len(text.split(" "))+15, do_sample=True, top_k=50, repetition_penalty = 1.2, add_special_tokens=False, num_return_sequences=5, temperature=0.95, top_p=0.95)])) ``` ## Training data We used a small (~5G) sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices. ### BibTeX entry and citation info Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020