RAG dataset & method release?
#1
by
pszemraj
- opened
Hi, great work on this model/ really interesting research direction too!
a new dataset of 45,088,768,000 tokens modeling common retrieval tasks.
I wanted to ask if the new RAG dataset you created as mentioned in the README will be released and/or the methodology/code to create it given a generic large corpus? I understand it's derived from the Common Corpus, but it would be great to know how.
I also have a similar question, since this model went through a SFT stage if not more (RLHF), where are the instruct datasets?