eryk-mazus commited on
Commit
a937b2a
1 Parent(s): 428a689

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md CHANGED
@@ -1,3 +1,29 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - allenai/MADLAD-400
5
+ - eryk-mazus/polka-pretrain-en-pl-v1
6
+ language:
7
+ - pl
8
+ - en
9
+ pipeline_tag: text-generation
10
  ---
11
+
12
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)
13
+
14
+ # Polka-1.1b
15
+
16
+
17
+ `polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional **5.7 billion Polish tokens**, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.
18
+
19
+ The training took 425 RTX 4090 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.
20
+
21
+ ## Notes
22
+
23
+ ...
24
+
25
+ ## Sample code
26
+
27
+ ```python
28
+ ...
29
+ ```