-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is there an existing issue for this?
- I have searched the existing issues
Bug description
Hi DLC team, I’m using the latest DLC version on the PyTorch engine. There seems to be an error with training this DLC engine on multiple GPUs.
Let me show you further:
If my pytorch_config.yaml is like this:
…
runner:
type: PoseTrainingRunner
gpus:
- 0
- 1
- 2
- 3
key_metric: test.mAP
key_metric_asc: true
eval_interval: 300
…
I can verify that when the code executes to here: (at line 73 /lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py), the run_config object has read the [0, 1, 2, 3] array of GPU IDs as seen below:
run_config
…
'model':{'backbone': {'type': 'ResNet', 'model_name': 'resnet50_gn', 'output_stride': 16, 'freeze_bn_stats': True, 'freeze_bn_weights': False}, 'backbone_output_channels': 2048, 'heads': {'bodypart': {...}}}
'net_type':'resnet_50'
'runner':{'type': 'PoseTrainingRunner', 'gpus': [0, 1, 2, 3], 'key_metric': 'test.mAP', 'key_metric_asc': True, 'eval_interval': 300, 'optimizer': {'type': 'AdamW', 'params': {...}}, 'scheduler': {'type': 'LRListScheduler', 'params': {...}}, 'snapshots': {'max_snapshots': 5, 'save_epochs': 25, 'save_optimizer_state': False}}
'train_settings':{'batch_size': 8, 'dataloader_workers': 0, 'dataloader_pin_memory': False, 'display_iters': 100, 'epochs': 200, 'seed': 42}
len():8
The code eventually executes to here, line 554 at this file: /lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py, where the code builds a runner. It’s supposed to know that “gpus” is an array of GPU IDs. However, “gpus” remains None and thus the runner would not initiate the DataParallel().
def build_training_runner(
runner_config: dict,
model_folder: Path,
task: Task,
model: nn.Module,
device: str,
gpus: list[int] | None = None,
snapshot_path: str | None = None,
logger: BaseLogger | None = None,
) -> TrainingRunner:
"""
Build a runner object according to a pytorch configuration file
Args:
runner_config: the configuration for the runner
model_folder: the folder where models should be saved
task: the task the runner will perform
model: the model to run
device: the device to use (e.g. {'cpu', 'cuda:0', 'mps'})
gpus: the list of GPU indices to use for multi-GPU training
snapshot_path: the snapshot from which to load the weights
logger: the logger to use, if any
Returns:
the runner that was built
"""
As a fix, I manually added this line right at the beginning, in order to force the “gpus” argument to take the values from runner config, if GPU IDs are specified:
gpus = runner_config["gpus"] if runner_config["gpus"] else gpus
I hope someone can take a look. Thanks!
Operating System
NAME="Red Hat Enterprise Linux"
VERSION="9.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
DeepLabCut version
Loading DLC 3.0.0rc4...
DLC loaded in light mode; you cannot use any GUI (labeling, relabeling and standalone GUI)
DeepLabCut mode
single animal
Device type
gpu
Steps To Reproduce
Please see above
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct