-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is there an existing issue for this?
- I have searched the existing issues
Bug description
I installed with conda yaml file (pip install "git+https://github.com/DeepLabCut/DeepLabCut.git@pytorch_dlc#egg=deeplabcut[gui,modelzoo,wandb]"), and started my first multi-animal project, with hrnet_w18 model type.
As I train the network, I saw "Training for epoch 1 done, starting evaluating", followed by a fail and Error message:
File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:189, in train(loader, run_config, task, device, gpus, logger_config, snapshot_path, transform, inference_transform, max_snapshots_to_keep)
186 else:
187 logging.info("\nStarting pose model training...\n" + (50 * "-"))
--> 189 runner.fit(
190 train_dataloader,
191 valid_dataloader,
192 epochs=run_config["train_settings"]["epochs"],
193 display_iters=run_config["train_settings"]["display_iters"],
194 )
File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:181, in TrainingRunner.fit(self, train_loader, valid_loader, epochs, display_iters)
179 with torch.no_grad():
180 logging.info(f"Training for epoch {e} done, starting evaluation")
--> 181 valid_loss = self._epoch(
182 valid_loader, mode="eval", display_iters=display_iters
183 )
184 if self._print_valid_loss:
185 msg += f", valid loss {float(valid_loss):.5f}"
File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:221, in TrainingRunner._epoch(self, loader, mode, display_iters)
219 loss_metrics = defaultdict(list)
220 for i, batch in enumerate(loader):
--> 221 losses_dict = self.step(batch, mode)
222 epoch_loss.append(losses_dict["total_loss"])
224 for key in losses_dict.keys():
File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:346, in PoseTrainingRunner.step(self, batch, mode)
337 self._update_epoch_predictions(
338 name="bodyparts",
339 paths=batch["path"],
(...)
343 scales=batch["scales"],
344 )
345 if "unique_bodypart" in predictions:
--> 346 self._update_epoch_predictions(
347 name="unique_bodyparts",
348 paths=batch["path"],
349 gt_keypoints=batch["annotations"]["keypoints_unique"],
350 pred_keypoints=predictions["unique_bodypart"]["poses"],
351 offsets=batch["offsets"],
352 scales=batch["scales"],
353 )
355 return {k: v.detach().cpu().numpy() for k, v in losses_dict.items()}
File /home/username/anaconda3/envs/deeplabcut-2.3.10/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:422, in PoseTrainingRunner._update_epoch_predictions(self, name, paths, gt_keypoints, pred_keypoints, scales, offsets)
420 keypoints = ground_truth[batch_id]
421 for kpts in keypoints:
--> 422 vis = kpts[-1]
423 if vis < 0:
424 kpts[-1] = 0
IndexError: invalid index to scalar variable.
The same error happened after I increased the number of unique points to 2.
After that, I tried again in debugger and found around deeplabcut/pose_estimation_pytorch/runners/train.py:422 that the dimension of keypoints changes between body parts and unique points. As a result, kpts[-1] could not deal with both conditions.
The training recovered after I bypassed this line.
Operating System
- operating system: Ubuntu 20.04 LTS
DeepLabCut version
- dlc version: 3.0.0rc2
DeepLabCut mode
multi animal
Device type
- gpu: Quadro RTX 6000
Steps To Reproduce
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct