Train this model on coding datasets to make a BEAST coding model
I bet if you trained this model on one, multiple or even all of these coding datasets, it would be better than wizarscoder hands down. Not sure if it would be better to train llama-2-13b on these datasets first then train it on the guanaco qlora, or the other way around
https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
https://huggingface.co/datasets/sahil2801/code_instructions_120k
https://huggingface.co/datasets/codeparrot/github-code-clean
https://huggingface.co/datasets/razent/wizardlm-code-evol-32k
Actually forget those datasets, i made my own if you want to use it. It would be much easier. I tried training the model myself but i kept getting errors and im not exprerienced enough to fix them π
Heres my dataset
https://huggingface.co/datasets/rombodawg/MegaCodeTraining112k
Training such a large dataset is outside my budget. To put things into perspective the Guanaco dataset is only 20.9MB versus your 433MB dataset.
I can try to help you debug the training issue you are having though. What script are you using to train and what are the errors?
Actually forget those datasets, i made my own if you want to use it. It would be much easier. I tried training the model myself but i kept getting errors and im not exprerienced enough to fix them π
Heres my dataset
https://huggingface.co/datasets/rombodawg/MegaCodeTraining112k
Why not the above datasets?
Because the one i made is much cleaner, and consise, in inclides both the 80k and 32k combined. Doesnt include code instruct 120k so you can use that to train seperatly because its formatted diffrently