Training Data Request
Will the data that was used in training this model be open sourced?
I plan on trying to reproduce some of your experiments.
Hi.
The data used to train this model is described in the model chart:
"Through continual pretraining on a combination of the ZelaiHandi dataset, containing approximately 1.5 billion high-quality Basque tokens, and a selected subset of the FineWeb dataset, around 300 million tokens"
Both of them are already publicly available:
ZelaiHandi (EU) https://huggingface.co/datasets/orai-nlp/ZelaiHandi
FineWeb (EN) https://huggingface.co/datasets/HuggingFaceFW/fineweb
i believe that the subset used from FineWeb was randomly sampled from the oficial "sample-10BT" fineweb subset, but if you require the exact subset employed for training for reproducivility, maybe @andercorral might be able to help you.