Retraining of Codeparrot on custom dataset
I am trying to generate Verilog code using CodeParrot. My dataset is consist of Verilog code examples. I want to get autocomplete code after writing some sample code. Following is a dataset example:
module AND(A,B,O);
input A,B;
output O;
assign O = A & B;
endmodule
I trained the model on such data examples. But when I try to use different inputs rather than A/B, the model either confuses or keeps using A/B. It seems the model is overfitting on training data. I want the model to understand what are incoming inputs and write code accordingly. It's basically a Text-to-code problem. Please suggest some appropriate way to overcome this issue. I am using 8000 data examples.
Hi, it’s hard to say without more details. How are you doing the training (is it codeparrot script) and which checkpoint of CodeParrot are you using for the fine-tuning?
It also depends on how good and diverse your data is, what’s the size in MB/GB or number of tokens of your dataset?
You can also plot the validation loss during the training and stop the model before it overfits.
Note that we also have the forums for questions about training/transformers.. You can find some related to overfitting