Skip to content

Conversation

@n-poulsen
Copy link
Contributor

@n-poulsen n-poulsen commented Nov 7, 2024

Many minor improvements to the user experience when training models. Overview of changes (see below for description of each change):

Fixes

Patch pycocotools printing during bounding box evaluation

Evaluating object detection performance with pycocotools led to useless/confusing lines being printed:

creating index...
index created!
Loading and preparing results...
Converting ndarray to lists...
(2198, 7)
...

The print functions inside of pycocotools have been patched so these lines are no longer printed.

Same mAP scale for pose and object detection metrics

Object detection mAP was reported between 0 and 1 (the default though pycocotools), while pose mAP was reported between 0 and 100. Both bounding box mAP and pose mAP is now reported between 0 and 100.

Typing fixes

When calling train_network, both snapshot_path and detector_path can be strings, paths or None.

Logging to learning_stats.csv

Both detector and pose model stats were logged to learning_stats.csv, so one would overwrite the other. This is no longer the case, with pose model stats logged to learning_stats.csv and detector stats are logged to learning_stats_detector.csv.

Non-zero starting epoch

When continuing to train a model, if the epochs given were larger than the starting epoch (the number of epochs for which the given weights were trained), the model was only trained for epochs - starting_epochs. So in the example below, the 2nd call to train_network would only train for 5 extra epochs. This was so the model is always trained for the number of epochs passed as an argument (so that in the example below, the model would be trained for 10 extra epochs and the last snapshots output would be snapshot-015.pt and snapshot-detector -015.pt)

import deeplabcut
from deeplabcut.pose_estimation_pytorch import DLCLoader

deeplabcut.train_network("dlc-project/config.yaml", shuffle=0, epochs=5, detector_epochs=5)
loader = DLCLoader("dlc-project/config.yaml", shuffle=0) 
deeplabcut.train_network(
    config="dlc-project/config.yaml",
    shuffle=0,
    epochs=10,
    detector_epochs=10,
    snapshot_path=loader.model_folder / f"snapshot-005.pt",
    detector_path=loader.model_folder / f"snapshot-detector-005.pt",
)

Detector training - evaluation loss

Torchvision object detection models cannot return both loss and predictions: it's one or the other. When evaluating during training, the predictions are used to obtain mAP/mAR metrics, so the loss is nan. To avoid any confusion, the validation loss (which is NaN as we don't have it) is no longer printed.

Printing metrics during training

Some visual improvements were made when logging metrics to the console during training.

When training detectors:

# Previous
...
Training for epoch 20 done, starting evaluation
creating index...
index created!
...
Accumulating evaluation results...
DONE (t=0.00s).
Epoch 20 performance:
metrics/test.mAP@50:95:0.134
metrics/test.mAP@50:0.514
metrics/test.mAP@75:0.042
metrics/test.mAR@50:95:0.300
metrics/test.mAR@50:0.833
metrics/test.mAR@75:0.167
Epoch 20/200 (lr=0.0001), train loss 4.69560
...

# ** New **
...
Epoch 20/200 (lr=0.0001), train loss 4.69560
Model performance:
  metrics/test.mAP@50:95:  13.37
  metrics/test.mAP@50:     51.37
  metrics/test.mAP@75:      4.21
  metrics/test.mAR@50:95:  30.00
  metrics/test.mAR@50:     83.33
  metrics/test.mAR@75:     16.67
...

When training pose estimation models:

# Previous
Epoch 193 performance:
metrics/test.rmse:  5.50
metrics/test.rmse_pcutoff:5.27
metrics/test.mAP:   100.000
metrics/test.mAR:   100.000
metrics/test.rmse_detections:23.134
metrics/test.rmse_detections_pcutoff:14.039
Epoch 193/200 (lr=0.0001), train loss 0.00039, valid loss 0.00027

# ** New **
Epoch 193/200 (lr=0.0001), train loss 0.00039, valid loss 0.00027
Model performance:
  metrics/test.rmse:                      5.50
  metrics/test.rmse_pcutoff:              5.27
  metrics/test.mAP:                     100.00
  metrics/test.mAR:                     100.00
  metrics/test.rmse_detections:           5.50
  metrics/test.rmse_detections_pcutoff:   5.27

Saving the best snapshot

Addresses #2663 to save the best snapshot during training. The best snapshot will be saved as snapshot-best-XYZ.pt, where XYZ is the number of epochs for which it was trained.

Resuming Training from a Given Snapshot

Adds an option to the pytorch_config.yaml to resume training from an existing snapshot, with:

...
detector:
    # weights from which to resume training the detector
    resume_training_from: /Users/john/.../train/snapshot-detector-250.pt 
...
# weights from which to resume training the pose model
resume_training_from: /Users/john/.../train/snapshot-100.pt

Fix PAF predictor running on MPS

The PAF predictor would fail when running on MPS (macOS GPU), as torch.round(...) is not yet implemented. An easy fix was to run scripts with PYTORCH_ENABLE_MPS_FALLBACK=1 set as an environment variable. This changes the operation to run with numpy so this fix is no longer needed.

Multi-GPU training: fix saving the state dict

Addresses issue #2749.

@n-poulsen n-poulsen requested a review from maximpavliv November 7, 2024 10:15
@n-poulsen n-poulsen added the enhancement New feature or request label Nov 7, 2024
Copy link
Contributor

@maximpavliv maximpavliv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside the variable names in the pytorch_config.md, this looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DLC3.0🔥 enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants