Training Data Sources

#7
by II-Matto - opened

Thanks for the great work! The model card says the training takes "600k instrunctions/answers", which is a tremendous amount of data. I was wondering how these SFT data are obtained and whether they are carefully examined to be not including similar ones to those in HumanEval.

CodeFuse AI org

Thanks for the great work! The model card says the training takes "600k instrunctions/answers", which is a tremendous amount of data. I was wondering how these SFT data are obtained and whether they are carefully examined to be not including similar ones to those in HumanEval.

Checkout our datasets https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k and https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k, we have done some decontamination work via similarity checking. You could still improve this process .

chencyudel changed discussion status to closed

Sign up or log in to comment