Text Generation
Transformers
Safetensors
Czech
mpt
custom_code
text-generation-inference
Inference Endpoints
mfajcik commited on
Commit
a888a41
·
verified ·
1 Parent(s): 538c5ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -17,6 +17,7 @@ Training was done on [Karolina](https://www.it4i.cz/en) cluster.
17
  - [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b)
18
 
19
  # <span style="color:blue">Latest Updates</span>
 
20
  - 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
21
  - 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
22
  -
@@ -35,7 +36,6 @@ Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
35
  However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
36
  The improvement over mistral7b is not significant.
37
 
38
- We will release more evaluations together with our benchmark **BenCzechMark** soon (see release plan!).
39
 
40
  ## Loss
41
  We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be
@@ -146,8 +146,8 @@ We release most (95.79%) of our training data corpus as [BUT-Large Czech Collect
146
  | Stage | Description | Date |
147
  |---------------|----------------|----------------|
148
  | 1 | 'Best' model + training data | 13.03.2024
149
- | 2 | All checkpoints + training code| Checkpoints are released. Code won't be released. We've used LLM foundry with slight adjustments, but the version is outdated now.
150
- | 3 | __Benczechmark__ a collection of Czech datasets for few-shot LLM evaluation **Get in touch if you want to contribute!** | Soon
151
  | 4 | Preprint Publication |
152
 
153
  ## Getting in Touch
@@ -168,7 +168,6 @@ by the Ministry of Education, Youth and Sports of the Czech Republic through the
168
  title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
169
  journal = {arXiv preprint arXiv:insert-arxiv-number-here},
170
  year = {2024},
171
- month = {March},
172
  eprint = {insert-arxiv-number-here},
173
  archivePrefix = {arXiv},
174
  primaryClass = {cs.CL},
 
17
  - [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b)
18
 
19
  # <span style="color:blue">Latest Updates</span>
20
+ - 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
21
  - 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
22
  - 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
23
  -
 
36
  However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
37
  The improvement over mistral7b is not significant.
38
 
 
39
 
40
  ## Loss
41
  We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be
 
146
  | Stage | Description | Date |
147
  |---------------|----------------|----------------|
148
  | 1 | 'Best' model + training data | 13.03.2024
149
+ | 2 | All checkpoints + training code| 10.04.2024 Checkpoints are released. Code won't be released. We've used LLM foundry with slight adjustments, but the version is outdated now.
150
+ | 3 | __Benczechmark__ a collection of Czech datasets for few-shot LLM evaluation **Get in touch if you want to contribute!** | 01.10.2024
151
  | 4 | Preprint Publication |
152
 
153
  ## Getting in Touch
 
168
  title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
169
  journal = {arXiv preprint arXiv:insert-arxiv-number-here},
170
  year = {2024},
 
171
  eprint = {insert-arxiv-number-here},
172
  archivePrefix = {arXiv},
173
  primaryClass = {cs.CL},