Training Data Sources
Thanks for the great work! The model card says the training takes "600k instrunctions/answers", which is a tremendous amount of data. I was wondering how these SFT data are obtained and whether they are carefully examined to be not including similar ones to those in HumanEval.
Thanks for the great work! The model card says the training takes "600k instrunctions/answers", which is a tremendous amount of data. I was wondering how these SFT data are obtained and whether they are carefully examined to be not including similar ones to those in HumanEval.
Checkout our datasets https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k and https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k, we have done some decontamination work via similarity checking. You could still improve this process .