Training data sources

#2
by BramVanroy - opened

Hello

Boreas is without a doubt a very powerful model! After evaluating it on some benchmarks, I have questions about the training data. As an example, it achieves 94.6 F1 on the DBRD dataset in zero-shot. That's on-par with the SOTA, a finetuned encoder model and far better than, for instance, GPT 3.5. So I am curious what exactly the training dataset was for pretraining and instruction tuning. Do you have a list of sources somewhere? I'd be very interested!

Thanks!

Hi Bram,

Thanks for evaluating the model!

The datasets used are described in the model card. These consist of both public datasets, and some private ones I've constructed myself.
With regard to the private datasets, I am pretty sure that these do not contain DBRD test data.
The public datasets used during pre-training and fine-tuning to create Boreas-7B-chat are these:

  • yhavinga/mc4_nl_cleaned
  • euirim/goodwiki
  • philschmid/flanv2
  • teknium/OpenHermes-2.5

If there is a DBRD test set contamination, my guess it could be from inclusion of flanv2.

The Boreas-10.7B-chat and Boreas-Qwen2-7B-chat models are trained on these public datasets:

  • euirim/goodwiki
  • teknium/OpenHermes-2.5
  • yhavinga/Openhermes-2.5-dutch-46k
  • diabolic6045/flanv2_cot_alpeca

HTH!

Awesome, thanks for the overview!

BramVanroy changed discussion status to closed

Sign up or log in to comment