May I ask where the training code is and does the training data include ABAP language?
May I ask where the training code is and does the training data include ABAP language
You can find the training codebase here: https://github.com/bigcode-project/Megatron-LM/tree/multi-query-attention There's also a repo for fine-tuning with PEFT or DeepSpeed: https://github.com/bigcode-project/starcoder
We didn't include ABAP you can find the full list of languages included in the training in the paper in table 1. But we do have ABAP in the Stack dataset if you want to try fine-tuning.
You can find the training codebase here: https://github.com/bigcode-project/Megatron-LM/tree/multi-query-attention There's also a repo for fine-tuning with PEFT or DeepSpeed: https://github.com/bigcode-project/starcoder
We didn't include ABAP you can find the full list of languages included in the training in the paper in table 1. But we do have ABAP in the Stack dataset if you want to try fine-tuning.
Thank you very much for your reply.
Do you have any suitable parameter suggestions for fine-tuning?
You can start from the default parameters in the repo and tune them if needed
You can start from the default parameters in the repo and tune them if needed
If using ABAP data from the Stack dataset, how long is the estimated fine-tuning time? I have an A100 80G machine.
I would like to do a budget assessment first
I saw a file for fine-tuning starcoder. Is this file about fine-tuning starcoderbase to starcoder?
https://github.com/bigcode-project/Megatron-LM/blob/finetune-starcoder/examples/finetune_bigcode_model.slurm
Where can I see the paths contained in the file?
STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
DATA_PATH=/fsx/boomcode/tokenized/python/
I saw a file for fine-tuning starcoder. Is this file about fine-tuning starcoderbase to starcoder?
https://github.com/bigcode-project/Megatron-LM/blob/finetune-starcoder/examples/finetune_bigcode_model.slurm
Where can I see the paths contained in the file?
STARCODER_PATH=/fsx/boomcode/starcoder/ CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp DATA_PATH=/fsx/boomcode/tokenized/python/
I have the same questions? where to get or generate these files?
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.
For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).
To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.
For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).
thx, but how to generate the files in "gpt2-preprocessed_content_with_meta_document" folder from raw .parquet files
you need to tokenize the data with Megatron-LM, see their readme
To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.
For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).
What directory does CHECKPOINT_PATH refer to? Where can I find CHECKPOINT_PATH for starcoder?
It's the paths where the checkpoints are saved, for us it was specific to our cluster