yhavinga/Boreas-7B-chat · Training data sources

29 days ago

•

Hello

Boreas is without a doubt a very powerful model! After evaluating it on some benchmarks, I have questions about the training data. As an example, it achieves 94.6 F1 on the DBRD dataset in zero-shot. That's on-par with the SOTA, a finetuned encoder model and far better than, for instance, GPT 3.5. So I am curious what exactly the training dataset was for pretraining and instruction tuning. Do you have a list of sources somewhere? I'd be very interested!

Thanks!

yhavinga

Owner 26 days ago

•

edited 26 days ago

Hi Bram,

Thanks for evaluating the model!

The datasets used are described in the model card. These consist of both public datasets, and some private ones I've constructed myself.
With regard to the private datasets, I am pretty sure that these do not contain DBRD test data.
The public datasets used during pre-training and fine-tuning to create Boreas-7B-chat are these:

yhavinga/mc4_nl_cleaned
euirim/goodwiki
philschmid/flanv2
teknium/OpenHermes-2.5

If there is a DBRD test set contamination, my guess it could be from inclusion of flanv2.

The Boreas-10.7B-chat and Boreas-Qwen2-7B-chat models are trained on these public datasets:

euirim/goodwiki
teknium/OpenHermes-2.5
yhavinga/Openhermes-2.5-dutch-46k
diabolic6045/flanv2_cot_alpeca

HTH!

BramVanroy

26 days ago

Awesome, thanks for the overview!

BramVanroy changed discussion status to closed 26 days ago