Text Generation
Transformers
Safetensors
Basque
llama
text-generation-inference
Inference Endpoints

Training Data Request

#1
by Owos - opened

Will the data that was used in training this model be open sourced?
I plan on trying to reproduce some of your experiments.

Orai NLP technologies org

Hi.

The data used to train this model is described in the model chart:
"Through continual pretraining on a combination of the ZelaiHandi dataset, containing approximately 1.5 billion high-quality Basque tokens, and a selected subset of the FineWeb dataset, around 300 million tokens"

Both of them are already publicly available:

ZelaiHandi (EU) https://huggingface.co/datasets/orai-nlp/ZelaiHandi
FineWeb (EN) https://huggingface.co/datasets/HuggingFaceFW/fineweb

i believe that the subset used from FineWeb was randomly sampled from the oficial "sample-10BT" fineweb subset, but if you require the exact subset employed for training for reproducivility, maybe @andercorral might be able to help you.

Sign up or log in to comment