## Train a model |
MMSegmentation implements distributed training and non-distributed training, |
which uses `MMDistributedDataParallel` and `MMDataParallel` respectively. |
All outputs (log files and checkpoints) will be saved to the working directory, |
which is specified by `work_dir` in the config file. |
By default we evaluate the model on the validation set after some iterations, you can change the evaluation interval by adding the interval argument in the training config. |
```python |
evaluation = dict(interval=4000) # This evaluate the model per 4000 iterations. |
``` |
**\*Important\***: The default learning rate in config files is for 4 GPUs and 2 img/gpu (batch size = 4x2 = 8). |
Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN. |
To trade speed with GPU memory, you may pass in `--options model.backbone.with_cp=True` to enable checkpoint in backbone. |
### Train with a single GPU |
```shell |
python tools/train.py ${CONFIG_FILE} [optional arguments] |
``` |
If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`. |
### Train with multiple GPUs |
```shell |
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments] |
``` |
Optional arguments are: |
- `--no-validate` (**not suggested**): By default, the codebase will perform evaluation at every k iterations during the training. To disable this behavior, use `--no-validate`. |
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file. |
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file (to continue the training process). |
- `--load-from ${CHECKPOINT_FILE}`: Load weights from a checkpoint file (to start finetuning for another task). |
Difference between `resume-from` and `load-from`: |
- `resume-from` loads both the model weights and optimizer state including the iteration number. |
- `load-from` loads only the model weights, starts the training from iteration 0. |
### Train with multiple machines |
If you run MMSegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.) |
```shell |
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR} |
``` |
Here is an example of using 16 GPUs to train PSPNet on the dev partition. |
```shell |
GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes |
``` |
You can check [slurm_train.sh](../tools/slurm_train.sh) for full arguments and environment variables. |
If you have just multiple machines connected with ethernet, you can refer to |
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility). |
Usually it is slow if you do not have high speed networking like InfiniBand. |
### Launch multiple jobs on a single machine |
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, |
you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`. |
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`. |
```shell |
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 |
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4 |
``` |
If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`. |
```shell |
MASTER_PORT=29500 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} |
MASTER_PORT=29501 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} |
``` |