-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
Description
Is there an existing issue for this?
- I have searched the existing issues
Bug description
Hello,
while using Deeplabcut with a Pytorch engine, I encountered an issue with the model dlcrnet_stride16_ms5. During the training process, a ValueError occurred stating "matrix contains invalid numeric entries." No matter how much I reduce the learning rate or adjust the training batch size, I cannot resolve this problem.
Operating System
Ubuntu 18.04
DeepLabCut version
DLC 3.0.0rc1
DeepLabCut mode
multi animal
Device type
gpu
Steps To Reproduce
1.creating a training dataset
2.pytorch_config.yaml as follows
data:
colormode: RGB
inference:
normalize_images: true
train:
affine:
p: 0.5
rotation: 30
scaling:
- 1.0
- 1.0
translation: 0
collate:
type: ResizeFromDataSizeCollate
min_scale: 0.4
max_scale: 1.0
min_short_side: 128
max_short_side: 1152
multiple_of: 32
to_square: false
covering: false
gaussian_noise: 12.75
hist_eq: false
motion_blur: false
normalize_images: true
device: auto
metadata:
project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
pose_config_path:
/mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
bodyparts:
- Front
- Right
- Middle
- Left
- FL1
- BL1
- FR1
- BR1
- BL2
- BR2
- FL2
- FR2
- Body1
- Body2
- Body3
unique_bodyparts: []
individuals:
- MARMOSET_1
- MARMOSET_2
with_identity: true
method: bu
model:
backbone:
type: DLCRNet
model_name: resnet50
pretrained: true
output_stride: 16
backbone_output_channels: 2304
pose_model:
stride: 8
heads:
bodypart:
type: DLCRNetHead
predictor:
type: PartAffinityFieldPredictor
num_animals: 2
num_multibodyparts: 15
num_uniquebodyparts: 0
nms_radius: 5
sigma: 1.0
locref_stdev: 7.2801
min_affinity: 0.05
graph: &id001
- - 0
- 1
- - 0
- 3
- - 2
- 3
- - 1
- 2
- - 0
- 2
- - 2
- 12
- - 12
- 13
- - 13
- 14
- - 7
- 14
- - 7
- 9
- - 5
- 8
- - 5
- 14
- - 6
- 12
- - 6
- 11
- - 4
- 12
- - 4
- 10
edges_to_keep:
- 0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
target_generator:
type: SequentialGenerator
generators:
- type: HeatmapPlateauGenerator
num_heatmaps: 15
pos_dist_thresh: 17
heatmap_mode: KEYPOINT
generate_locref: true
locref_std: 7.2801
- type: PartAffinityFieldGenerator
graph: *id001
width: 20
criterion:
heatmap:
type: WeightedBCECriterion
weight: 1.0
locref:
type: WeightedHuberCriterion
weight: 0.05
paf:
type: WeightedHuberCriterion
weight: 0.1
heatmap_config:
channels:
- 2304
- 15
kernel_size:
- 3
strides:
- 2
locref_config:
channels:
- 2304
- 30
kernel_size:
- 3
strides:
- 2
paf_config:
channels:
- 2304
- 32
kernel_size:
- 3
strides:
- 2
num_stages: 5
identity:
type: HeatmapHead
predictor:
type: HeatmapPredictor
location_refinement: false
target_generator:
type: HeatmapPlateauGenerator
num_heatmaps: 2
pos_dist_thresh: 17
heatmap_mode: INDIVIDUAL
generate_locref: false
criterion:
heatmap:
type: WeightedBCECriterion
weight: 1.0
heatmap_config:
channels:
- 2304
- 2
kernel_size:
- 3
strides:
- 2
net_type: dlcrnet_stride16_ms5
runner:
type: PoseTrainingRunner
gpus:
key_metric: test.mAP
key_metric_asc: true
eval_interval: 25
optimizer:
type: AdamW
params:
lr: 0.0001
scheduler:
type: LRListScheduler
params:
lr_list:
- - 1e-05
- - 1e-06
milestones:
- 90
- 120
snapshots:
max_snapshots: 5
save_epochs: 50
save_optimizer_state: false
train_settings:
batch_size: 16
dataloader_workers: 0
dataloader_pin_memory: true
display_iters: 1000
epochs: 200
seed: 42
3.train_network
Relevant log output
Training with configuration:
data:
colormode: RGB
inference:
normalize_images: True
train:
affine:
p: 0.5
rotation: 30
scaling: [1.0, 1.0]
translation: 0
collate:
type: ResizeFromDataSizeCollate
min_scale: 0.4
max_scale: 1.0
min_short_side: 128
max_short_side: 1152
multiple_of: 32
to_square: False
covering: False
gaussian_noise: 12.75
hist_eq: False
motion_blur: False
normalize_images: True
device: auto
metadata:
project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
pose_config_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
bodyparts: ['Front', 'Right', 'Middle', 'Left', 'FL1', 'BL1', 'FR1', 'BR1', 'BL2', 'BR2', 'FL2', 'FR2', 'Body1', 'Body2', 'Body3']
unique_bodyparts: []
individuals: ['MARMOSET_1', 'MARMOSET_2']
with_identity: True
method: bu
model:
backbone:
type: DLCRNet
model_name: resnet50
pretrained: True
output_stride: 16
backbone_output_channels: 2304
pose_model:
stride: 8
heads:
bodypart:
type: DLCRNetHead
predictor:
type: PartAffinityFieldPredictor
num_animals: 2
num_multibodyparts: 15
num_uniquebodyparts: 0
nms_radius: 5
sigma: 1.0
locref_stdev: 7.2801
min_affinity: 0.05
graph: [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]]
edges_to_keep: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
target_generator:
type: SequentialGenerator
generators: [{'type': 'HeatmapPlateauGenerator', 'num_heatmaps': 15, 'pos_dist_thresh': 17, 'heatmap_mode': 'KEYPOINT', 'generate_locref': True, 'locref_std': 7.2801}, {'type': 'PartAffinityFieldGenerator', 'graph': [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]], 'width': 20}]
criterion:
heatmap:
type: WeightedBCECriterion
weight: 1.0
locref:
type: WeightedHuberCriterion
weight: 0.05
paf:
type: WeightedHuberCriterion
weight: 0.1
heatmap_config:
channels: [2304, 15]
kernel_size: [3]
strides: [2]
locref_config:
channels: [2304, 30]
kernel_size: [3]
strides: [2]
paf_config:
channels: [2304, 32]
kernel_size: [3]
strides: [2]
num_stages: 5
identity:
type: HeatmapHead
predictor:
type: HeatmapPredictor
location_refinement: False
target_generator:
type: HeatmapPlateauGenerator
num_heatmaps: 2
pos_dist_thresh: 17
heatmap_mode: INDIVIDUAL
generate_locref: False
criterion:
heatmap:
type: WeightedBCECriterion
weight: 1.0
heatmap_config:
channels: [2304, 2]
kernel_size: [3]
strides: [2]
net_type: dlcrnet_stride16_ms5
runner:
type: PoseTrainingRunner
gpus: None
key_metric: test.mAP
key_metric_asc: True
eval_interval: 25
optimizer:
type: AdamW
params:
lr: 0.0001
scheduler:
type: LRListScheduler
params:
lr_list: [[1e-05], [1e-06]]
milestones: [90, 120]
snapshots:
max_snapshots: 5
save_epochs: 50
save_optimizer_state: False
train_settings:
batch_size: 16
dataloader_workers: 0
dataloader_pin_memory: True
display_iters: 1000
epochs: 200
seed: 42
Loading pretrained weights from Hugging Face hub (timm/resnet50.a1_in1k)
[timm/resnet50.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
Data Transforms:
Training: Compose([
Affine(always_apply=False, p=0.5, interpolation=1, mask_interpolation=0, cval=0, mode=0, scale={'x': (1.0, 1.0), 'y': (1.0, 1.0)}, translate_percent=None, translate_px={'x': (0, 0), 'y': (0, 0)}, rotate=(-30, 30), fit_output=False, shear={'x': (0.0, 0.0), 'y': (0.0, 0.0)}, cval_mask=0, keep_ratio=True, rotate_method='largest_box'),
GaussNoise(always_apply=False, p=0.5, var_limit=(0, 162.5625), per_channel=True, mean=0),
Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
Validation: Compose([
Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
Using custom collate function: {'type': 'ResizeFromDataSizeCollate', 'min_scale': 0.4, 'max_scale': 1.0, 'min_short_side': 128, 'max_short_side': 1152, 'multiple_of': 32, 'to_square': False}
Using 102 images and 44 for testing
Starting pose model training...
--------------------------------------------------
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv2d(input, weight, bias, self.stride,
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Epoch 1/200 (lr=0.0001), train loss 0.35200
Epoch 2/200 (lr=0.0001), train loss 0.16082
Epoch 3/200 (lr=0.0001), train loss 0.13178
Epoch 4/200 (lr=0.0001), train loss 0.08032
Epoch 5/200 (lr=0.0001), train loss 0.04940
Epoch 6/200 (lr=0.0001), train loss 0.04160
Epoch 7/200 (lr=0.0001), train loss 0.04610
Epoch 8/200 (lr=0.0001), train loss 0.05107
Epoch 9/200 (lr=0.0001), train loss 0.03826
Epoch 10/200 (lr=0.0001), train loss 0.03448
Epoch 11/200 (lr=0.0001), train loss 0.02770
Epoch 12/200 (lr=0.0001), train loss 0.02214
Epoch 13/200 (lr=0.0001), train loss 0.02729
Epoch 14/200 (lr=0.0001), train loss 0.03104
Epoch 15/200 (lr=0.0001), train loss 0.02087
Epoch 16/200 (lr=0.0001), train loss 0.03396
Epoch 17/200 (lr=0.0001), train loss 0.02404
Epoch 18/200 (lr=0.0001), train loss 0.02167
Epoch 19/200 (lr=0.0001), train loss 0.02466
Epoch 20/200 (lr=0.0001), train loss 0.02104
Epoch 21/200 (lr=0.0001), train loss 0.02200
Epoch 22/200 (lr=0.0001), train loss 0.02281
Epoch 23/200 (lr=0.0001), train loss 0.02097
Epoch 24/200 (lr=0.0001), train loss 0.01914
Training for epoch 25 done, starting evaluation
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:74: RuntimeWarning: Mean of empty slice
distance_matrix[i, j] = np.nanmean(d)
{
"name": "ValueError",
"message": "matrix contains invalid numeric entries",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 deeplabcut.train_network(config_path, shuffle=0)
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/compat.py:245, in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights, modelprefix, superanimal_name, superanimal_transfer_learning, engine, **torch_kwargs)
242 if \"display_iters\" not in torch_kwargs:
243 torch_kwargs[\"display_iters\"] = displayiters
--> 245 return train_network(
246 config,
247 shuffle=shuffle,
248 trainingsetindex=trainingsetindex,
249 modelprefix=modelprefix,
250 max_snapshots_to_keep=max_snapshots_to_keep,
251 **torch_kwargs,
252 )
254 raise NotImplementedError(f\"This function is not implemented for {engine}\")
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:336, in train_network(config, shuffle, trainingsetindex, modelprefix, device, snapshot_path, detector_path, batch_size, epochs, save_epochs, detector_batch_size, detector_epochs, detector_save_epochs, display_iters, max_snapshots_to_keep, pose_threshold, **kwargs)
323 detector_run_config[\"train_settings\"][\"weight_init\"] = loader.model_cfg[
324 \"train_settings\"
325 ].get(\"weight_init\")
326 train(
327 loader=loader,
328 run_config=detector_run_config,
(...)
333 max_snapshots_to_keep=max_snapshots_to_keep,
334 )
--> 336 train(
337 loader=loader,
338 run_config=loader.model_cfg,
339 task=pose_task,
340 device=device,
341 logger_config=loader.model_cfg.get(\"logger\"),
342 snapshot_path=snapshot_path,
343 max_snapshots_to_keep=max_snapshots_to_keep,
344 )
346 destroy_file_logging()
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:189, in train(loader, run_config, task, device, gpus, logger_config, snapshot_path, transform, inference_transform, max_snapshots_to_keep)
186 else:
187 logging.info(\"\
Starting pose model training...\
\" + (50 * \"-\"))
--> 189 runner.fit(
190 train_dataloader,
191 valid_dataloader,
192 epochs=run_config[\"train_settings\"][\"epochs\"],
193 display_iters=run_config[\"train_settings\"][\"display_iters\"],
194 )
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:181, in TrainingRunner.fit(self, train_loader, valid_loader, epochs, display_iters)
179 with torch.no_grad():
180 logging.info(f\"Training for epoch {e} done, starting evaluation\")
--> 181 valid_loss = self._epoch(
182 valid_loader, mode=\"eval\", display_iters=display_iters
183 )
184 if self._print_valid_loss:
185 msg += f\", valid loss {float(valid_loss):.5f}\"
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:236, in TrainingRunner._epoch(self, loader, mode, display_iters)
234 perf_metrics = None
235 if mode == \"eval\":
--> 236 perf_metrics = self._compute_epoch_metrics()
237 self._metadata[\"metrics\"] = perf_metrics
238 self._epoch_predictions = {}
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:365, in PoseTrainingRunner._compute_epoch_metrics(self)
358 \"\"\"Computes the metrics using the data accumulated during an epoch
359 Returns:
360 A dictionary containing the different losses for the step
361 \"\"\"
362 num_animals = max(
363 [len(kpts) for kpts in self._epoch_ground_truth[\"bodyparts\"].values()]
364 )
--> 365 poses = pair_predicted_individuals_with_gt(
366 self._epoch_predictions[\"bodyparts\"], self._epoch_ground_truth[\"bodyparts\"]
367 )
369 # pad predictions if there are any missing (needed for top-down models)
370 gt, pred = {}, {}
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/metrics/scoring.py:391, in pair_predicted_individuals_with_gt(predictions, ground_truth)
389 matched_poses = {}
390 for image, pose in predictions.items():
--> 391 match_individuals = rmse_match_prediction_to_gt(pose, ground_truth[image])
392 matched_poses[image] = pose[match_individuals]
394 return matched_poses
File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:76, in rmse_match_prediction_to_gt(pred_kpts, gt_kpts)
73 d = (gt_idv[mask, :2] - pred_idv[mask, :2]) ** 2
74 distance_matrix[i, j] = np.nanmean(d)
---> 76 _, col_ind = linear_sum_assignment(distance_matrix) # len == len(valid_gt_indices)
78 gt_idx_to_pred_idx = {
79 valid_gt_indices[valid_gt_index]: valid_pred_indices[valid_pred_index]
80 for valid_gt_index, valid_pred_index in enumerate(col_ind)
81 }
82 matched_pred = {valid_pred_indices[i] for i in col_ind}
ValueError: matrix contains invalid numeric entries"
}Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct