LLM360/K2 · Clarify Table for Datasets and Mix

May 31, 2024

The table given in readme shows only 1.3T tokens, which I assume means that this table is only for stage 1 training. I see that figure 1 in the paper gives an overview of the stage 1 and stage 2 training data, but other than this graph, there is not much information about what data was in stage 2 training, especially because the table under the K2 dataset page incorrectly has the data mix for stage 2 as the same table for the chat finetune, when the stage 2 training was supposed to be 69.3b tokens, as stated in the paper, and not 2.9b tokens (and the figure in the paper makes it clear that the data mixes are different).
I also made a discussion post on the K2 dataset page regarding the incorrect table given there.

Lastly, I just want to say thank you! I don't think there's nearly enough fully open source models, so I really appreciate all the work the LLM360 team is doing!

hunterhector

LLM360 org Jun 1, 2024

Thanks for spotting these issues. I believed I have answered those in your other thread: https://huggingface.co/datasets/LLM360/K2Datasets/discussions/2.

Let's discuss there and I will close this thread to avoid confusion.

And thanks a lot for the kind word!

hunterhector changed discussion status to closed Jun 1, 2024