Skip to content

DLC3 MA, hrnet_w18, epoch predictions broke with unique_bodyparts #2667

@Zelin2001

Description

@Zelin2001

Is there an existing issue for this?

  • I have searched the existing issues

Bug description

I installed with conda yaml file (pip install "git+https://github.com/DeepLabCut/DeepLabCut.git@pytorch_dlc#egg=deeplabcut[gui,modelzoo,wandb]"), and started my first multi-animal project, with hrnet_w18 model type.

As I train the network, I saw "Training for epoch 1 done, starting evaluating", followed by a fail and Error message:

File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:189, in train(loader, run_config, task, device, gpus, logger_config, snapshot_path, transform, inference_transform, max_snapshots_to_keep)
    186 else:
    187     logging.info("\nStarting pose model training...\n" + (50 * "-"))
--> 189 runner.fit(
    190     train_dataloader,
    191     valid_dataloader,
    192     epochs=run_config["train_settings"]["epochs"],
    193     display_iters=run_config["train_settings"]["display_iters"],
    194 )

File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:181, in TrainingRunner.fit(self, train_loader, valid_loader, epochs, display_iters)
    179 with torch.no_grad():
    180     logging.info(f"Training for epoch {e} done, starting evaluation")
--> 181     valid_loss = self._epoch(
    182         valid_loader, mode="eval", display_iters=display_iters
    183     )
    184     if self._print_valid_loss:
    185         msg += f", valid loss {float(valid_loss):.5f}"

File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:221, in TrainingRunner._epoch(self, loader, mode, display_iters)
    219 loss_metrics = defaultdict(list)
    220 for i, batch in enumerate(loader):
--> 221     losses_dict = self.step(batch, mode)
    222     epoch_loss.append(losses_dict["total_loss"])
    224     for key in losses_dict.keys():

File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:346, in PoseTrainingRunner.step(self, batch, mode)
    337     self._update_epoch_predictions(
    338         name="bodyparts",
    339         paths=batch["path"],
   (...)
    343         scales=batch["scales"],
    344     )
    345     if "unique_bodypart" in predictions:
--> 346         self._update_epoch_predictions(
    347             name="unique_bodyparts",
    348             paths=batch["path"],
    349             gt_keypoints=batch["annotations"]["keypoints_unique"],
    350             pred_keypoints=predictions["unique_bodypart"]["poses"],
    351             offsets=batch["offsets"],
    352             scales=batch["scales"],
    353         )
    355 return {k: v.detach().cpu().numpy() for k, v in losses_dict.items()}

File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:422, in PoseTrainingRunner._update_epoch_predictions(self, name, paths, gt_keypoints, pred_keypoints, scales, offsets)
    420 keypoints = ground_truth[batch_id]
    421 for kpts in keypoints:
--> 422     vis = kpts[-1]
    423     if vis < 0:
    424         kpts[-1] = 0

IndexError: invalid index to scalar variable.

The same error happened after I increased the number of unique points to 2.

After that, I tried again in debugger and found around deeplabcut/pose_estimation_pytorch/runners/train.py:422 that the dimension of keypoints changes between body parts and unique points. As a result, kpts[-1] could not deal with both conditions.

The training recovered after I bypassed this line.

Operating System

  • operating system: Ubuntu 20.04 LTS

DeepLabCut version

  • dlc version: 3.0.0rc2

DeepLabCut mode

multi animal

Device type

  • gpu: Quadro RTX 6000

Steps To Reproduce

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions