eryk-mazus
commited on
Commit
•
a937b2a
1
Parent(s):
428a689
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,29 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- allenai/MADLAD-400
|
5 |
+
- eryk-mazus/polka-pretrain-en-pl-v1
|
6 |
+
language:
|
7 |
+
- pl
|
8 |
+
- en
|
9 |
+
pipeline_tag: text-generation
|
10 |
---
|
11 |
+
|
12 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)
|
13 |
+
|
14 |
+
# Polka-1.1b
|
15 |
+
|
16 |
+
|
17 |
+
`polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional **5.7 billion Polish tokens**, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.
|
18 |
+
|
19 |
+
The training took 425 RTX 4090 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.
|
20 |
+
|
21 |
+
## Notes
|
22 |
+
|
23 |
+
...
|
24 |
+
|
25 |
+
## Sample code
|
26 |
+
|
27 |
+
```python
|
28 |
+
...
|
29 |
+
```
|