File size: 2,968 Bytes
9244fb2
c552c00
 
 
 
 
 
c85efce
9244fb2
c552c00
 
9244fb2
c552c00
1023a2d
c552c00
f493a8b
c552c00
4859cf5
c552c00
 
 
 
 
ddd724a
 
 
 
 
c552c00
 
 
 
 
4859cf5
c552c00
 
 
 
 
 
 
 
18159a5
c552c00
18159a5
c552c00
 
 
18159a5
 
c552c00
 
 
 
 
 
e947472
c552c00
 
e947472
c552c00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e947472
c552c00
 
e947472
c552c00
 
 
 
65a48a7
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language:
- hu
- en
- zh
tags:
- text-generation
- puli
license: cc-by-nc-4.0
widget:
- text: Elmesélek egy történetet a nyelvtechnológiáról.
---

# PULI GPTrio (7.67B billion parameter)

For further details read [our paper](http://real.mtak.hu/173960/1/TSD_2023_GPT.pdf) or testing our instruct model, see [our demo site](https://juniper.nytud.hu/demo/gptrio).

  - Hungarian-English-Chinese trilingual GPT-NeoX model (7.67B billion parameter)
  - Trained with EleutherAI's GPT-NeoX [github](https://github.com/EleutherAI/gpt-neox)
  - Checkpoint: 410 000 steps

## Dataset

- Hungarian: 41.5 billion words (314 GB)
- English: 61.9 billion words (391 GB)
- Github: 6 million documents (33 GB)
- Chinese: 98.7 billion Chinese character (340 GB)
  - (12 billion non Chinese token)

## Limitations

- max_seq_length = 2048
- float16
- vocab size: 150 016


## Citation
If you use this model, please cite the following paper:

```
@inproceedings {yang-puli-gptrio,
    title = {Mono- and multilingual GPT-3 models for Hungarian},
	booktitle = {Text, Speech, and Dialogue},
	year = {2023},
	publisher = {Springer Nature Switzerland},
    series = {Lecture Notes in Computer Science},
	address = {Plzeň, Czech Republic},
	author = {Yang, Zijian Győző and Laki, László János and Váradi, Tamás and Prószéky, Gábor},
	pages = {94--104},
    isbn = {978-3-031-40498-6}
}
```

## Usage

```python
from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-GPTrio")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.9,
    max_length=100,
)

gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
```
## Usage with pipeline

```python
from transformers import pipeline, GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-GPTrio")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

print(generator(prompt)[0]["generated_text"])
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_NYTK__PULI-GPTrio)

| Metric                | Value                     |
|-----------------------|---------------------------|
| Avg.                  | 30.07   |
| ARC (25-shot)         | 30.72          |
| HellaSwag (10-shot)   | 53.49    |
| MMLU (5-shot)         | 24.73         |
| TruthfulQA (0-shot)   | 39.03   |
| Winogrande (5-shot)   | 57.77   |
| GSM8K (5-shot)        | 0.76        |
| DROP (3-shot)         | 4.03         |