File size: 2,345 Bytes
496bab2 a9a62c7 496bab2 819d2d9 496bab2 1a2cbd4 1d4a2a5 1a2cbd4 dc21456 f35fb1c dc21456 1a2cbd4 496bab2 f13244a 496bab2 8cff0af 496bab2 f35fb1c 26ad7e4 496bab2 ef53bac 496bab2 26ad7e4 496bab2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
---
language:
- el
tags:
- pytorch
- causal-lm
widget:
- text: "Το αγαπημένο μου μέρος είναι"
license: apache-2.0
---
# Greek (el) GPT2 model - small
<img src="https://huggingface.co/lighteternal/gpt2-finetuned-greek-small/raw/main/GPT2el.png" width="600"/>
#### A new version (recommended) trained on 5x more data is available at: https://huggingface.co/lighteternal/gpt2-finetuned-greek
### By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
* language: el
* licence: apache-2.0
* dataset: ~5GB of Greek corpora
* model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
* pre-processing: tokenization + BPE segmentation
### Model description
A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2(small). 

Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages. 

Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing
### How to use
```
from transformers import pipeline
model = "lighteternal/gpt2-finetuned-greek-small"
generator = pipeline(
'text-generation',
device=0,
model=f'{model}',
tokenizer=f'{model}')
text = "Μια φορά κι έναν καιρό"
print("\\\\
".join([x.get("generated_text") for x in generator(
text,
max_length=len(text.split(" "))+15,
do_sample=True,
top_k=50,
repetition_penalty = 1.2,
add_special_tokens=False,
num_return_sequences=5,
temperature=0.95,
top_p=0.95)]))
```
## Training data
We used a small (~5GB) sample from a consolidated Greek corpus based on CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices. A bigger corpus is expected to provide better results (T0D0).
### Acknowledgement
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call)
Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020
|