Training

From the previous tutorials, you may now have a custom model and a data loader. To run training, users typically have a preference in one of the following two styles:

Custom Training Loop

With a model and a data loader ready, everything else needed to write a training loop can be found in PyTorch, and you are free to write the training loop yourself. This style allows researchers to manage the entire training logic more clearly and have full control. One such example is provided in tools/plain_train_net.py.

Any customization on the training logic is then easily controlled by the user.

Trainer Abstraction

We also provide a standardized "trainer" abstraction with a hook system that helps simplify the standard training behavior. It includes the following two instantiations:

SimpleTrainer provides a minimal training loop for single-cost single-optimizer single-data-source training, with nothing else. Other tasks (checkpointing, logging, etc) can be implemented using the hook system.
DefaultTrainer is a SimpleTrainer initialized from a yacs config, used by tools/train_net.py and many scripts. It includes more standard default behaviors that one might want to opt in, including default configurations for optimizer, learning rate schedule, logging, evaluation, checkpointing etc.

To customize a DefaultTrainer:

For simple customizations (e.g. change optimizer, evaluator, LR scheduler, data loader, etc.), overwrite its methods in a subclass, just like tools/train_net.py.

For extra tasks during training, check the hook system to see if it's supported.

As an example, to print hello during training:

class HelloHook(HookBase):
  def after_step(self):
    if self.trainer.iter % 100 == 0:
      print(f"Hello at iteration {self.trainer.iter}!")

Using a trainer+hook system means there will always be some non-standard behaviors that cannot be supported, especially in research. For this reason, we intentionally keep the trainer & hook system minimal, rather than powerful. If anything cannot be achieved by such a system, it's easier to start from tools/plain_train_net.py to implement custom training logic manually.

Logging of Metrics

During training, detectron2 models and trainer put metrics to a centralized EventStorage. You can use the following code to access it and log metrics to it:

from detectron2.utils.events import get_event_storage

# inside the model:
if self.training:
  value = # compute the value from inputs
  storage = get_event_storage()
  storage.put_scalar("some_accuracy", value)

Refer to its documentation for more details.

Metrics are then written to various destinations with EventWriter. DefaultTrainer enables a few EventWriter with default configurations. See above for how to customize them.