codefuse-ai/CodeFuse-CodeLlama-34B · Training Data Sources

Oct 19, 2023

Thanks for the great work! The model card says the training takes "600k instrunctions/answers", which is a tremendous amount of data. I was wondering how these SFT data are obtained and whether they are carefully examined to be not including similar ones to those in HumanEval.

chencyudel

CodeFuse AI org Nov 29, 2023

Thanks for the great work! The model card says the training takes "600k instrunctions/answers", which is a tremendous amount of data. I was wondering how these SFT data are obtained and whether they are carefully examined to be not including similar ones to those in HumanEval.

Checkout our datasets https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k and https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k, we have done some decontamination work via similarity checking. You could still improve this process .

chencyudel changed discussion status to closed Nov 29, 2023