orai-nlp/Llama-eus-8B · Training Data Request

Hi.

The data used to train this model is described in the model chart:
"Through continual pretraining on a combination of the ZelaiHandi dataset, containing approximately 1.5 billion high-quality Basque tokens, and a selected subset of the FineWeb dataset, around 300 million tokens"

Both of them are already publicly available:

ZelaiHandi (EU) https://huggingface.co/datasets/orai-nlp/ZelaiHandi
FineWeb (EN) https://huggingface.co/datasets/HuggingFaceFW/fineweb

i believe that the subset used from FineWeb was randomly sampled from the oficial "sample-10BT" fineweb subset, but if you require the exact subset employed for training for reproducivility, maybe @andercorral might be able to help you.