|
## Train a model |
|
|
|
MMSegmentation implements distributed training and non-distributed training, |
|
which uses `MMDistributedDataParallel` and `MMDataParallel` respectively. |
|
|
|
All outputs (log files and checkpoints) will be saved to the working directory, |
|
which is specified by `work_dir` in the config file. |
|
|
|
By default we evaluate the model on the validation set after some iterations, you can change the evaluation interval by adding the interval argument in the training config. |
|
|
|
```python |
|
evaluation = dict(interval=4000) # This evaluate the model per 4000 iterations. |
|
``` |
|
|
|
**\*Important\***: The default learning rate in config files is for 4 GPUs and 2 img/gpu (batch size = 4x2 = 8). |
|
Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN. |
|
|
|
To trade speed with GPU memory, you may pass in `--options model.backbone.with_cp=True` to enable checkpoint in backbone. |
|
|
|
### Train with a single GPU |
|
|
|
```shell |
|
python tools/train.py ${CONFIG_FILE} [optional arguments] |
|
``` |
|
|
|
If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`. |
|
|
|
### Train with multiple GPUs |
|
|
|
```shell |
|
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments] |
|
``` |
|
|
|
Optional arguments are: |
|
|
|
- `--no-validate` (**not suggested**): By default, the codebase will perform evaluation at every k iterations during the training. To disable this behavior, use `--no-validate`. |
|
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file. |
|
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file (to continue the training process). |
|
- `--load-from ${CHECKPOINT_FILE}`: Load weights from a checkpoint file (to start finetuning for another task). |
|
|
|
Difference between `resume-from` and `load-from`: |
|
|
|
- `resume-from` loads both the model weights and optimizer state including the iteration number. |
|
- `load-from` loads only the model weights, starts the training from iteration 0. |
|
|
|
### Train with multiple machines |
|
|
|
If you run MMSegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.) |
|
|
|
```shell |
|
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR} |
|
``` |
|
|
|
Here is an example of using 16 GPUs to train PSPNet on the dev partition. |
|
|
|
```shell |
|
GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes |
|
``` |
|
|
|
You can check [slurm_train.sh](../tools/slurm_train.sh) for full arguments and environment variables. |
|
|
|
If you have just multiple machines connected with ethernet, you can refer to |
|
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility). |
|
Usually it is slow if you do not have high speed networking like InfiniBand. |
|
|
|
### Launch multiple jobs on a single machine |
|
|
|
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, |
|
you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`. |
|
|
|
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`. |
|
|
|
```shell |
|
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 |
|
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4 |
|
``` |
|
|
|
If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`. |
|
|
|
```shell |
|
MASTER_PORT=29500 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} |
|
MASTER_PORT=29501 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} |
|
``` |
|
|