Skip to content

The code doesn't seem to know when user specifies "gpus: [0, 1, 2, 3]" in the pytorch_config.yaml (PyTorch engine) #2744

@thuann2cats

Description

@thuann2cats

Is there an existing issue for this?

  • I have searched the existing issues

Bug description

Hi DLC team, I’m using the latest DLC version on the PyTorch engine. There seems to be an error with training this DLC engine on multiple GPUs.
Let me show you further:

If my pytorch_config.yaml is like this:

…
runner:
  type: PoseTrainingRunner
  gpus:
  - 0
  - 1
  - 2
  - 3
  key_metric: test.mAP
  key_metric_asc: true
  eval_interval: 300
…

I can verify that when the code executes to here: (at line 73 /lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py), the run_config object has read the [0, 1, 2, 3] array of GPU IDs as seen below:

run_config

…
'model':{'backbone': {'type': 'ResNet', 'model_name': 'resnet50_gn', 'output_stride': 16, 'freeze_bn_stats': True, 'freeze_bn_weights': False}, 'backbone_output_channels': 2048, 'heads': {'bodypart': {...}}}
'net_type':'resnet_50'
'runner':{'type': 'PoseTrainingRunner', 'gpus': [0, 1, 2, 3], 'key_metric': 'test.mAP', 'key_metric_asc': True, 'eval_interval': 300, 'optimizer': {'type': 'AdamW', 'params': {...}}, 'scheduler': {'type': 'LRListScheduler', 'params': {...}}, 'snapshots': {'max_snapshots': 5, 'save_epochs': 25, 'save_optimizer_state': False}}
'train_settings':{'batch_size': 8, 'dataloader_workers': 0, 'dataloader_pin_memory': False, 'display_iters': 100, 'epochs': 200, 'seed': 42}
len():8

The code eventually executes to here, line 554 at this file: /lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py, where the code builds a runner. It’s supposed to know that “gpus” is an array of GPU IDs. However, “gpus” remains None and thus the runner would not initiate the DataParallel().

def build_training_runner(
    runner_config: dict,
    model_folder: Path,
    task: Task,
    model: nn.Module,
    device: str,
    gpus: list[int] | None = None,
    snapshot_path: str | None = None,
    logger: BaseLogger | None = None,
) -> TrainingRunner:
    """
    Build a runner object according to a pytorch configuration file

    Args:
        runner_config: the configuration for the runner
        model_folder: the folder where models should be saved
        task: the task the runner will perform
        model: the model to run
        device: the device to use (e.g. {'cpu', 'cuda:0', 'mps'})
        gpus: the list of GPU indices to use for multi-GPU training
        snapshot_path: the snapshot from which to load the weights
        logger: the logger to use, if any

    Returns:
        the runner that was built
    """

As a fix, I manually added this line right at the beginning, in order to force the “gpus” argument to take the values from runner config, if GPU IDs are specified:
gpus = runner_config["gpus"] if runner_config["gpus"] else gpus
I hope someone can take a look. Thanks!

Operating System

NAME="Red Hat Enterprise Linux"
VERSION="9.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"

DeepLabCut version

Loading DLC 3.0.0rc4...
DLC loaded in light mode; you cannot use any GUI (labeling, relabeling and standalone GUI)

DeepLabCut mode

single animal

Device type

gpu

Steps To Reproduce

Please see above

Relevant log output

No response

Anything else?

No response

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions