Safetensors
llama

RAG dataset & method release?

#1
by pszemraj - opened

Hi, great work on this model/ really interesting research direction too!

a new dataset of 45,088,768,000 tokens modeling common retrieval tasks.

I wanted to ask if the new RAG dataset you created as mentioned in the README will be released and/or the methodology/code to create it given a generic large corpus? I understand it's derived from the Common Corpus, but it would be great to know how.

I also have a similar question, since this model went through a SFT stage if not more (RLHF), where are the instruct datasets?

Sign up or log in to comment