eryk-mazus
/

polka-1.1b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

polka-1.1b / README.md

eryk-mazus's picture

Update README.md

a937b2a about 1 year ago

|

981 Bytes

	---
	license: apache-2.0
	datasets:
	- allenai/MADLAD-400
	- eryk-mazus/polka-pretrain-en-pl-v1
	language:
	- pl
	- en
	pipeline_tag: text-generation
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)

	# Polka-1.1b


	`polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional 5.7 billion Polish tokens, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.

	The training took 425 RTX 4090 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.

	## Notes

	...

	## Sample code

	```python
	...
	```