May I ask where the training code is and does the training data include ABAP language？

#11

by leelichen - opened Jun 1, 2023

Discussion

leelichen

Jun 1, 2023

May I ask where the training code is and does the training data include ABAP language

loubnabnl

BigCode org Jun 1, 2023

You can find the training codebase here: https://github.com/bigcode-project/Megatron-LM/tree/multi-query-attention There's also a repo for fine-tuning with PEFT or DeepSpeed: https://github.com/bigcode-project/starcoder
We didn't include ABAP you can find the full list of languages included in the training in the paper in table 1. But we do have ABAP in the Stack dataset if you want to try fine-tuning.

leelichen

Jun 2, 2023

You can find the training codebase here: https://github.com/bigcode-project/Megatron-LM/tree/multi-query-attention There's also a repo for fine-tuning with PEFT or DeepSpeed: https://github.com/bigcode-project/starcoder
We didn't include ABAP you can find the full list of languages included in the training in the paper in table 1. But we do have ABAP in the Stack dataset if you want to try fine-tuning.

Thank you very much for your reply.
Do you have any suitable parameter suggestions for fine-tuning？

loubnabnl

BigCode org Jun 2, 2023

You can start from the default parameters in the repo and tune them if needed

leelichen

Jun 2, 2023

You can start from the default parameters in the repo and tune them if needed

If using ABAP data from the Stack dataset, how long is the estimated fine-tuning time? I have an A100 80G machine.
I would like to do a budget assessment first

leelichen

Jun 5, 2023

I saw a file for fine-tuning starcoder. Is this file about fine-tuning starcoderbase to starcoder?

https://github.com/bigcode-project/Megatron-LM/blob/finetune-starcoder/examples/finetune_bigcode_model.slurm

Where can I see the paths contained in the file?

STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
DATA_PATH=/fsx/boomcode/tokenized/python/

wwn

Jun 6, 2023

I saw a file for fine-tuning starcoder. Is this file about fine-tuning starcoderbase to starcoder?

https://github.com/bigcode-project/Megatron-LM/blob/finetune-starcoder/examples/finetune_bigcode_model.slurm

Where can I see the paths contained in the file?

STARCODER_PATH=/fsx/boomcode/starcoder/
CHECKPOINT_PATH=/fsx/boomcode/starcoderpy/$SLURM_JOB_ID
TOKENIZER_FILE=/fsx/boomcode/tokenizer-starcoder/tokenizer.json
WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp
DATA_PATH=/fsx/boomcode/tokenized/python/

I have the same questions? where to get or generate these files?

WEIGHTS_TRAIN=/fsx/boomcode/datamix_python/train_data_paths.txt.tmp
WEIGHTS_VALID=/fsx/boomcode/datamix_python/valid_data_paths.txt.tmp

loubnabnl

BigCode org Jun 6, 2023

To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.

For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).

wwn

Jun 7, 2023

To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.

For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).

thx, but how to generate the files in "gpt2-preprocessed_content_with_meta_document" folder from raw .parquet files

loubnabnl

BigCode org Jun 7, 2023

you need to tokenize the data with Megatron-LM, see their readme

leelichen

Jun 8, 2023

To generate the data weights you can use this repo: https://github.com/bigcode-project/bigcode-data-mix#2---substitute-the-data-path.

For short trainings, or non distributed (1 A100 in your case) using PEFT indicated here: https://github.com/bigcode-project/starcoder would be faster and easier to setup. Otherwise full fine-tuning could be expensive, for reference the fine-tuning of StarCoderBase on 35B of Python tokens to get StarCoder took ~2 days on 512 GPUs (in your case ABAP has much less data than Python so it would take much less time, but full-finetuning could be slow for one A100).

What directory does CHECKPOINT_PATH refer to? Where can I find CHECKPOINT_PATH for starcoder?

loubnabnl

BigCode org Jun 8, 2023

It's the paths where the checkpoints are saved, for us it was specific to our cluster

christopher changed discussion status to closed Jun 8, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment