Improvements to the training UX #2775
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Many minor improvements to the user experience when training models. Overview of changes (see below for description of each change):
learning_stats.csv(log detector and pose model to different files)pytorch_config.yamlfileFixes
Patch pycocotools printing during bounding box evaluation
Evaluating object detection performance with
pycocotoolsled to useless/confusing lines being printed:The print functions inside of
pycocotoolshave been patched so these lines are no longer printed.Same mAP scale for pose and object detection metrics
Object detection mAP was reported between 0 and 1 (the default though pycocotools), while pose mAP was reported between 0 and 100. Both bounding box mAP and pose mAP is now reported between 0 and 100.
Typing fixes
When calling
train_network, bothsnapshot_pathanddetector_pathcan be strings, paths or None.Logging to
learning_stats.csvBoth detector and pose model stats were logged to
learning_stats.csv, so one would overwrite the other. This is no longer the case, with pose model stats logged tolearning_stats.csvand detector stats are logged tolearning_stats_detector.csv.Non-zero starting epoch
When continuing to train a model, if the epochs given were larger than the starting epoch (the number of epochs for which the given weights were trained), the model was only trained for
epochs - starting_epochs. So in the example below, the 2nd call totrain_networkwould only train for 5 extra epochs. This was so the model is always trained for the number of epochs passed as an argument (so that in the example below, the model would be trained for 10 extra epochs and the last snapshots output would besnapshot-015.ptandsnapshot-detector -015.pt)Detector training - evaluation loss
Torchvision object detection models cannot return both loss and predictions: it's one or the other. When evaluating during training, the predictions are used to obtain mAP/mAR metrics, so the loss is
nan. To avoid any confusion, the validation loss (which is NaN as we don't have it) is no longer printed.Printing metrics during training
Some visual improvements were made when logging metrics to the console during training.
When training detectors:
When training pose estimation models:
Saving the best snapshot
Addresses #2663 to save the best snapshot during training. The best snapshot will be saved as
snapshot-best-XYZ.pt, whereXYZis the number of epochs for which it was trained.Resuming Training from a Given Snapshot
Adds an option to the
pytorch_config.yamlto resume training from an existing snapshot, with:Fix PAF predictor running on MPS
The PAF predictor would fail when running on MPS (macOS GPU), as torch.round(...) is not yet implemented. An easy fix was to run scripts with
PYTORCH_ENABLE_MPS_FALLBACK=1set as an environment variable. This changes the operation to run with numpy so this fix is no longer needed.Multi-GPU training: fix saving the state dict
Addresses issue #2749.