Improvements to the training UX #2775

n-poulsen · 2024-11-07T10:15:13Z

Many minor improvements to the user experience when training models. Overview of changes (see below for description of each change):

Patch pycocotools printing during bounding box evaluation (so we don't get annoying prints)
Same mAP scale for pose and object detection metrics (0 to 100, not 0 to 1)
Typing fixes
Logging to learning_stats.csv (log detector and pose model to different files)
Non-zero starting epoch (bug fix)
Detector training - don't show evaluation loss
Improved look of printed metrics during training
Automatic save model snapshots with best test error in DLC3.0 #2663
Add option to select weights to resume training from the pytorch_config.yaml file
Fix: running PAF predictors on MPS
Could not load a saved snapshot due to mismatched state_dict keys #2749

Fixes

Patch pycocotools printing during bounding box evaluation

Evaluating object detection performance with pycocotools led to useless/confusing lines being printed:

creating index...
index created!
Loading and preparing results...
Converting ndarray to lists...
(2198, 7)
...

The print functions inside of pycocotools have been patched so these lines are no longer printed.

Same mAP scale for pose and object detection metrics

Object detection mAP was reported between 0 and 1 (the default though pycocotools), while pose mAP was reported between 0 and 100. Both bounding box mAP and pose mAP is now reported between 0 and 100.

Typing fixes

When calling train_network, both snapshot_path and detector_path can be strings, paths or None.

Logging to `learning_stats.csv`

Both detector and pose model stats were logged to learning_stats.csv, so one would overwrite the other. This is no longer the case, with pose model stats logged to learning_stats.csv and detector stats are logged to learning_stats_detector.csv.

Non-zero starting epoch

When continuing to train a model, if the epochs given were larger than the starting epoch (the number of epochs for which the given weights were trained), the model was only trained for epochs - starting_epochs. So in the example below, the 2nd call to train_network would only train for 5 extra epochs. This was so the model is always trained for the number of epochs passed as an argument (so that in the example below, the model would be trained for 10 extra epochs and the last snapshots output would be snapshot-015.pt and snapshot-detector -015.pt)

import deeplabcut
from deeplabcut.pose_estimation_pytorch import DLCLoader

deeplabcut.train_network("dlc-project/config.yaml", shuffle=0, epochs=5, detector_epochs=5)
loader = DLCLoader("dlc-project/config.yaml", shuffle=0) 
deeplabcut.train_network(
    config="dlc-project/config.yaml",
    shuffle=0,
    epochs=10,
    detector_epochs=10,
    snapshot_path=loader.model_folder / f"snapshot-005.pt",
    detector_path=loader.model_folder / f"snapshot-detector-005.pt",
)

Detector training - evaluation loss

Torchvision object detection models cannot return both loss and predictions: it's one or the other. When evaluating during training, the predictions are used to obtain mAP/mAR metrics, so the loss is nan. To avoid any confusion, the validation loss (which is NaN as we don't have it) is no longer printed.

Printing metrics during training

Some visual improvements were made when logging metrics to the console during training.

When training detectors:

# Previous
...
Training for epoch 20 done, starting evaluation
creating index...
index created!
...
Accumulating evaluation results...
DONE (t=0.00s).
Epoch 20 performance:
metrics/test.mAP@50:95:0.134
metrics/test.mAP@50:0.514
metrics/test.mAP@75:0.042
metrics/test.mAR@50:95:0.300
metrics/test.mAR@50:0.833
metrics/test.mAR@75:0.167
Epoch 20/200 (lr=0.0001), train loss 4.69560
...

# ** New **
...
Epoch 20/200 (lr=0.0001), train loss 4.69560
Model performance:
  metrics/test.mAP@50:95:  13.37
  metrics/test.mAP@50:     51.37
  metrics/test.mAP@75:      4.21
  metrics/test.mAR@50:95:  30.00
  metrics/test.mAR@50:     83.33
  metrics/test.mAR@75:     16.67
...

When training pose estimation models:

# Previous
Epoch 193 performance:
metrics/test.rmse:  5.50
metrics/test.rmse_pcutoff:5.27
metrics/test.mAP:   100.000
metrics/test.mAR:   100.000
metrics/test.rmse_detections:23.134
metrics/test.rmse_detections_pcutoff:14.039
Epoch 193/200 (lr=0.0001), train loss 0.00039, valid loss 0.00027

# ** New **
Epoch 193/200 (lr=0.0001), train loss 0.00039, valid loss 0.00027
Model performance:
  metrics/test.rmse:                      5.50
  metrics/test.rmse_pcutoff:              5.27
  metrics/test.mAP:                     100.00
  metrics/test.mAR:                     100.00
  metrics/test.rmse_detections:           5.50
  metrics/test.rmse_detections_pcutoff:   5.27

Saving the best snapshot

Addresses #2663 to save the best snapshot during training. The best snapshot will be saved as snapshot-best-XYZ.pt, where XYZ is the number of epochs for which it was trained.

Resuming Training from a Given Snapshot

Adds an option to the pytorch_config.yaml to resume training from an existing snapshot, with:

...
detector:
    # weights from which to resume training the detector
    resume_training_from: /Users/john/.../train/snapshot-detector-250.pt 
...
# weights from which to resume training the pose model
resume_training_from: /Users/john/.../train/snapshot-100.pt

Fix PAF predictor running on MPS

The PAF predictor would fail when running on MPS (macOS GPU), as torch.round(...) is not yet implemented. An easy fix was to run scripts with PYTORCH_ENABLE_MPS_FALLBACK=1 set as an environment variable. This changes the operation to run with numpy so this fix is no longer needed.

Multi-GPU training: fix saving the state dict

Addresses issue #2749.

…anager to save best snapshot.

maximpavliv

Beside the variable names in the pytorch_config.md, this looks good to me!

docs/pytorch/pytorch_config.md

deeplabcut/pose_estimation_pytorch/runners/snapshots.py

n-poulsen added 6 commits November 6, 2024 11:41

minor training improvements

751fd37

fix detector config creation to include best metric. fixed snapshot m…

bc38e1d

…anager to save best snapshot.

add option to resume training

810fc1a

round with numpy as torch.round is not implemented with MPS

525b72d

fix state dict saving when training on multiple GPUs

d7401ff

bug fix: only print metrics at epochs where eval occured

d152ca1

n-poulsen added the DLC3.0🔥 label Nov 7, 2024

n-poulsen requested a review from maximpavliv November 7, 2024 10:15

n-poulsen added the enhancement New feature or request label Nov 7, 2024

maximpavliv approved these changes Nov 8, 2024

View reviewed changes

docs/pytorch/pytorch_config.md Outdated Show resolved Hide resolved

docs/pytorch/pytorch_config.md Outdated Show resolved Hide resolved

deeplabcut/pose_estimation_pytorch/runners/snapshots.py Show resolved Hide resolved

typo fixes

7ec0d17

n-poulsen merged commit dd0fbfa into pytorch_dlc Nov 8, 2024

n-poulsen deleted the niels/train_use_improvements branch November 8, 2024 16:33

This was referenced Nov 12, 2024

Could not load a saved snapshot due to mismatched state_dict keys #2749

Closed

How can users specify the initial weights of the network in DeepLabCut 3.0?（pytorch engine） #2784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements to the training UX #2775

Improvements to the training UX #2775

Uh oh!

n-poulsen commented Nov 7, 2024 •

edited

Loading

Uh oh!

maximpavliv left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Improvements to the training UX #2775

Improvements to the training UX #2775

Uh oh!

Conversation

n-poulsen commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Patch pycocotools printing during bounding box evaluation

Same mAP scale for pose and object detection metrics

Typing fixes

Logging to learning_stats.csv

Non-zero starting epoch

Detector training - evaluation loss

Printing metrics during training

Saving the best snapshot

Resuming Training from a Given Snapshot

Fix PAF predictor running on MPS

Multi-GPU training: fix saving the state dict

Uh oh!

maximpavliv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

n-poulsen commented Nov 7, 2024 •

edited

Loading

Logging to `learning_stats.csv`