File size: 2,968 Bytes
9244fb2 c552c00 c85efce 9244fb2 c552c00 9244fb2 c552c00 1023a2d c552c00 f493a8b c552c00 4859cf5 c552c00 ddd724a c552c00 4859cf5 c552c00 18159a5 c552c00 18159a5 c552c00 18159a5 c552c00 e947472 c552c00 e947472 c552c00 e947472 c552c00 e947472 c552c00 65a48a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
language:
- hu
- en
- zh
tags:
- text-generation
- puli
license: cc-by-nc-4.0
widget:
- text: Elmesélek egy történetet a nyelvtechnológiáról.
---
# PULI GPTrio (7.67B billion parameter)
For further details read [our paper](http://real.mtak.hu/173960/1/TSD_2023_GPT.pdf) or testing our instruct model, see [our demo site](https://juniper.nytud.hu/demo/gptrio).
- Hungarian-English-Chinese trilingual GPT-NeoX model (7.67B billion parameter)
- Trained with EleutherAI's GPT-NeoX [github](https://github.com/EleutherAI/gpt-neox)
- Checkpoint: 410 000 steps
## Dataset
- Hungarian: 41.5 billion words (314 GB)
- English: 61.9 billion words (391 GB)
- Github: 6 million documents (33 GB)
- Chinese: 98.7 billion Chinese character (340 GB)
- (12 billion non Chinese token)
## Limitations
- max_seq_length = 2048
- float16
- vocab size: 150 016
## Citation
If you use this model, please cite the following paper:
```
@inproceedings {yang-puli-gptrio,
title = {Mono- and multilingual GPT-3 models for Hungarian},
booktitle = {Text, Speech, and Dialogue},
year = {2023},
publisher = {Springer Nature Switzerland},
series = {Lecture Notes in Computer Science},
address = {Plzeň, Czech Republic},
author = {Yang, Zijian Győző and Laki, László János and Váradi, Tamás and Prószéky, Gábor},
pages = {94--104},
isbn = {978-3-031-40498-6}
}
```
## Usage
```python
from transformers import GPTNeoXForCausalLM, AutoTokenizer
model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-GPTrio")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.9,
max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
```
## Usage with pipeline
```python
from transformers import pipeline, GPTNeoXForCausalLM, AutoTokenizer
model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-GPTrio")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
print(generator(prompt)[0]["generated_text"])
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_NYTK__PULI-GPTrio)
| Metric | Value |
|-----------------------|---------------------------|
| Avg. | 30.07 |
| ARC (25-shot) | 30.72 |
| HellaSwag (10-shot) | 53.49 |
| MMLU (5-shot) | 24.73 |
| TruthfulQA (0-shot) | 39.03 |
| Winogrande (5-shot) | 57.77 |
| GSM8K (5-shot) | 0.76 |
| DROP (3-shot) | 4.03 |
|