File size: 2,345 Bytes
496bab2
 
 
 
 
 
 
 
a9a62c7
496bab2
 
819d2d9
496bab2
1a2cbd4
1d4a2a5
 
1a2cbd4
 
dc21456
f35fb1c
dc21456
1a2cbd4
496bab2
 
 
f13244a
496bab2
 
 
 
 
8cff0af
 
 
 
496bab2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f35fb1c
26ad7e4
496bab2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef53bac
496bab2
 
 
26ad7e4
 
 
 
496bab2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
 
---
language:
- el
tags:
- pytorch
- causal-lm
widget:
- text: "Το αγαπημένο μου μέρος είναι"
license: apache-2.0


---
# Greek (el) GPT2 model - small


<img src="https://huggingface.co/lighteternal/gpt2-finetuned-greek-small/raw/main/GPT2el.png" width="600"/>


#### A new version (recommended) trained on 5x more data is available at: https://huggingface.co/lighteternal/gpt2-finetuned-greek 

### By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)

* language: el
* licence: apache-2.0
* dataset: ~5GB of Greek corpora 
* model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
* pre-processing: tokenization + BPE segmentation

### Model description

A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2(small).  &NewLine;

Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages.  &NewLine;

Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing


### How to use

```
from transformers import pipeline

model = "lighteternal/gpt2-finetuned-greek-small"

generator = pipeline(
    'text-generation',
    device=0,
    model=f'{model}',
    tokenizer=f'{model}')
    
text = "Μια φορά κι έναν καιρό"

print("\\\\
".join([x.get("generated_text") for x in generator(
    text,
    max_length=len(text.split(" "))+15,
    do_sample=True,
    top_k=50,
    repetition_penalty = 1.2,
    add_special_tokens=False,
    num_return_sequences=5,
    temperature=0.95,
    top_p=0.95)]))
    
```


## Training data

We used a small (~5GB) sample from a consolidated Greek corpus based on CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices. A bigger corpus is expected to provide better results (T0D0).



### Acknowledgement 

The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call)

Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020