May I ask where the training code is and does the training data include ABAP language?

#11
by leelichen - opened

May I ask where the training code is and does the training data include ABAP language

BigCode org

You can find the training codebase here: https://github.com/bigcode-project/Megatron-LM/tree/multi-query-attention There's also a repo for fine-tuning with PEFT or DeepSpeed: https://github.com/bigcode-project/starcoder
We didn't include ABAP you can find the full list of languages included in the training in the paper in table 1. But we do have ABAP in the Stack dataset if you want to try fine-tuning.

You can find the training codebase here: https://github.com/bigcode-project/Megatron-LM/tree/multi-query-attention There's also a repo for fine-tuning with PEFT or DeepSpeed: https://github.com/bigcode-project/starcoder
We didn't include ABAP you can find the full list of languages included in the training in the paper in table 1. But we do have ABAP in the Stack dataset if you want to try fine-tuning.

Thank you very much for your reply.
Do you have any suitable parameter suggestions for fine-tuning?

BigCode org

You can start from the default parameters in the repo and tune them if needed

You can start from the default parameters in the repo and tune them if needed

If using ABAP data from the Stack dataset, how long is the estimated fine-tuning time? I have an A100 80G machine.
I would like to do a budget assessment first

I saw a file for fine-tuning starcoder. Is this file about fine-tuning starcoderbase to starcoder?

https://github.com/bigcode-project/Megatron-LM/blob/finetune-starcoder/examples/finetune_bigcode_model.slurm

Where can I see the paths contained in the file?

STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
DATA_PATH=/fsx/boomcode/tokenized/python/

I saw a file for fine-tuning starcoder. Is this file about fine-tuning starcoderbase to starcoder?

https://github.com/bigcode-project/Megatron-LM/blob/finetune-starcoder/examples/finetune_bigcode_model.slurm

Where can I see the paths contained in the file?

STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
DATA_PATH=/fsx/boomcode/tokenized/python/

I have the same questions? where to get or generate these files?

WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp

BigCode org

To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.

For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).

To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.

For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).

thx, but how to generate the files in "gpt2-preprocessed_content_with_meta_document" folder from raw .parquet files

BigCode org

you need to tokenize the data with Megatron-LM, see their readme

To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.

For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).

What directory does CHECKPOINT_PATH refer to? Where can I find CHECKPOINT_PATH for starcoder?

BigCode org

It's the paths where the checkpoints are saved, for us it was specific to our cluster

christopher changed discussion status to closed

Sign up or log in to comment