--- license: llama2 language: - hu - en tags: - puli --- # PULI LlumiX 32K (6.74B billion parameter) For further details or testing our instruct model, see [our demo site](https://puli.nytud.hu/puli-llumix-instruct). - Trained with OpenChatKit [github](https://github.com/togethercomputer/OpenChatKit) - The [LLaMA-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) model were continuously pretrained on Hungarian dataset - The model has been extended to a context length of 32K with position interpolation - Checkpoint: 100 000 steps ## Dataset for continued pretraining - Hungarian: 7.9 billion words, documents (763K) that exceed 5000 words in length - English: Long Context QA (2 billion words), BookSum (78 million words) ## Limitations - max_seq_length = 32 768 - float16 - vocab size: 32 000 ## Usage with pipeline ```python from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer model = LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-32K") tokenizer = LlamaTokenizer.from_pretrained("NYTK/PULI-LlumiX-32K") prompt = "Elmesélek egy történetet a nyelvtechnológiáról." generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer) print(generator(prompt, max_new_tokens=30)[0]["generated_text"]) ```