Text Generation
Transformers
Safetensors
Czech
mpt
custom_code
text-generation-inference
Inference Endpoints
mfajcik commited on
Commit
f9cf8ee
1 Parent(s): f0f46c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -3
README.md CHANGED
@@ -2,7 +2,7 @@
2
  license: apache-2.0
3
  ---
4
  # Introduction
5
- CSMPT7b is a large Czech language model continously pretrained from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model trained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) has Czech tokenizer, obtained using our vocabulary swap method (see below).
6
 
7
  # Eval
8
  Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
@@ -48,7 +48,33 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
48
 
49
 
50
  ## Training Method
51
- tbd.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
 
54
  # Usage
@@ -92,7 +118,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
92
 
93
  ```
94
  # Training Data
95
- We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
96
 
97
 
98
  # Our Release Plan
 
2
  license: apache-2.0
3
  ---
4
  # Introduction
5
+ CSMPT7b is a large Czech language model continously pretrained for 272b training steps from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) using Czech tokenizer, obtained using our vocabulary swap method (see below).
6
 
7
  # Eval
8
  Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
 
48
 
49
 
50
  ## Training Method
51
+ ### Vocabulary Swap
52
+ The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
53
+ We managed to align 4,177 english tokens with corresponding czech tokens.
54
+
55
+ ## Hyperparameters
56
+ Not mentioned hyperparameters were kept the same as for MPT.
57
+ | **Name** | **Value** | **Note** |
58
+ |----------------------------|---------------|----------------------------------------------------------------------------------------------|
59
+ | training sw | llm-foundry | We've done some minor patching (e.g., to allow DDP sync over file) |
60
+ | dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
61
+ | tokenizer_size | 64k | Same as in [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) |
62
+ | max_seq_len | 2048 | |
63
+ | batch_size | 1024 | |
64
+ | learning_rate | 1.0e-4 | |
65
+ | optimizer | LionW | |
66
+ | optimizer_betas | 0.9/0.95 | |
67
+ | optimizer_weight_decay | 0 | |
68
+ | optimizer_eps | 1.0e-08 | |
69
+ | gradient_clipping_max_norm | 1.0 | |
70
+ | attn_impl | flash2 | we used triton flash-attn 1 implementation for initial ~60k steps |
71
+ | positional_encoding | alibi | |
72
+ | fsdp | FULL_SHARD | (we had implementation issues with hybrid sharding in llm-foundry) |
73
+ | precision | bf16 | |
74
+ | scheduler | cosine | |
75
+ | scheduler_warmup | 100 steps | |
76
+ | scheduler_steps | 170,000 | |
77
+ | scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
78
 
79
 
80
  # Usage
 
118
 
119
  ```
120
  # Training Data
121
+ We release most (95.79%) of our training data corpus as [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
122
 
123
 
124
  # Our Release Plan