Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries`

### Is there an existing issue for this?

- [X] I have searched the existing issues

### Bug description

Hello,
while using Deeplabcut with a Pytorch engine, I encountered an issue with the model dlcrnet_stride16_ms5. During the training process, a ValueError occurred stating "matrix contains invalid numeric entries." No matter how much I reduce the learning rate or adjust the training batch size, I cannot resolve this problem.

### Operating System

Ubuntu 18.04

### DeepLabCut version

DLC 3.0.0rc1

### DeepLabCut mode

multi animal

### Device type

gpu

### Steps To Reproduce

1.creating a training dataset
2.pytorch_config.yaml as follows

```
data:
  colormode: RGB
  inference:
    normalize_images: true
  train:
    affine:
      p: 0.5
      rotation: 30
      scaling:
      - 1.0
      - 1.0
      translation: 0
    collate:
      type: ResizeFromDataSizeCollate
      min_scale: 0.4
      max_scale: 1.0
      min_short_side: 128
      max_short_side: 1152
      multiple_of: 32
      to_square: false
    covering: false
    gaussian_noise: 12.75
    hist_eq: false
    motion_blur: false
    normalize_images: true
device: auto
metadata:
  project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
  pose_config_path: 
    /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
  bodyparts:
  - Front
  - Right
  - Middle
  - Left
  - FL1
  - BL1
  - FR1
  - BR1
  - BL2
  - BR2
  - FL2
  - FR2
  - Body1
  - Body2
  - Body3
  unique_bodyparts: []
  individuals:
  - MARMOSET_1
  - MARMOSET_2
  with_identity: true
method: bu
model:
  backbone:
    type: DLCRNet
    model_name: resnet50
    pretrained: true
    output_stride: 16
  backbone_output_channels: 2304
  pose_model:
    stride: 8
  heads:
    bodypart:
      type: DLCRNetHead
      predictor:
        type: PartAffinityFieldPredictor
        num_animals: 2
        num_multibodyparts: 15
        num_uniquebodyparts: 0
        nms_radius: 5
        sigma: 1.0
        locref_stdev: 7.2801
        min_affinity: 0.05
        graph: &id001
        - - 0
          - 1
        - - 0
          - 3
        - - 2
          - 3
        - - 1
          - 2
        - - 0
          - 2
        - - 2
          - 12
        - - 12
          - 13
        - - 13
          - 14
        - - 7
          - 14
        - - 7
          - 9
        - - 5
          - 8
        - - 5
          - 14
        - - 6
          - 12
        - - 6
          - 11
        - - 4
          - 12
        - - 4
          - 10
        edges_to_keep:
        - 0
        - 1
        - 2
        - 3
        - 4
        - 5
        - 6
        - 7
        - 8
        - 9
        - 10
        - 11
        - 12
        - 13
        - 14
        - 15
      target_generator:
        type: SequentialGenerator
        generators:
        - type: HeatmapPlateauGenerator
          num_heatmaps: 15
          pos_dist_thresh: 17
          heatmap_mode: KEYPOINT
          generate_locref: true
          locref_std: 7.2801
        - type: PartAffinityFieldGenerator
          graph: *id001
          width: 20
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
        locref:
          type: WeightedHuberCriterion
          weight: 0.05
        paf:
          type: WeightedHuberCriterion
          weight: 0.1
      heatmap_config:
        channels:
        - 2304
        - 15
        kernel_size:
        - 3
        strides:
        - 2
      locref_config:
        channels:
        - 2304
        - 30
        kernel_size:
        - 3
        strides:
        - 2
      paf_config:
        channels:
        - 2304
        - 32
        kernel_size:
        - 3
        strides:
        - 2
      num_stages: 5
    identity:
      type: HeatmapHead
      predictor:
        type: HeatmapPredictor
        location_refinement: false
      target_generator:
        type: HeatmapPlateauGenerator
        num_heatmaps: 2
        pos_dist_thresh: 17
        heatmap_mode: INDIVIDUAL
        generate_locref: false
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
      heatmap_config:
        channels:
        - 2304
        - 2
        kernel_size:
        - 3
        strides:
        - 2
net_type: dlcrnet_stride16_ms5
runner:
  type: PoseTrainingRunner
  gpus:
  key_metric: test.mAP
  key_metric_asc: true
  eval_interval: 25
  optimizer:
    type: AdamW
    params:
      lr: 0.0001
  scheduler:
    type: LRListScheduler
    params:
      lr_list:
      - - 1e-05
      - - 1e-06
      milestones:
      - 90
      - 120
  snapshots:
    max_snapshots: 5
    save_epochs: 50
    save_optimizer_state: false
train_settings:
  batch_size: 16
  dataloader_workers: 0
  dataloader_pin_memory: true
  display_iters: 1000
  epochs: 200
  seed: 42


```
3.train_network


### Relevant log output

```shell
Training with configuration:
data:
  colormode: RGB
  inference:
    normalize_images: True
  train:
    affine:
      p: 0.5
      rotation: 30
      scaling: [1.0, 1.0]
      translation: 0
    collate:
      type: ResizeFromDataSizeCollate
      min_scale: 0.4
      max_scale: 1.0
      min_short_side: 128
      max_short_side: 1152
      multiple_of: 32
      to_square: False
    covering: False
    gaussian_noise: 12.75
    hist_eq: False
    motion_blur: False
    normalize_images: True
device: auto
metadata:
  project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
  pose_config_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
  bodyparts: ['Front', 'Right', 'Middle', 'Left', 'FL1', 'BL1', 'FR1', 'BR1', 'BL2', 'BR2', 'FL2', 'FR2', 'Body1', 'Body2', 'Body3']
  unique_bodyparts: []
  individuals: ['MARMOSET_1', 'MARMOSET_2']
  with_identity: True
method: bu
model:
  backbone:
    type: DLCRNet
    model_name: resnet50
    pretrained: True
    output_stride: 16
  backbone_output_channels: 2304
  pose_model:
    stride: 8
  heads:
    bodypart:
      type: DLCRNetHead
      predictor:
        type: PartAffinityFieldPredictor
        num_animals: 2
        num_multibodyparts: 15
        num_uniquebodyparts: 0
        nms_radius: 5
        sigma: 1.0
        locref_stdev: 7.2801
        min_affinity: 0.05
        graph: [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]]
        edges_to_keep: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
      target_generator:
        type: SequentialGenerator
        generators: [{'type': 'HeatmapPlateauGenerator', 'num_heatmaps': 15, 'pos_dist_thresh': 17, 'heatmap_mode': 'KEYPOINT', 'generate_locref': True, 'locref_std': 7.2801}, {'type': 'PartAffinityFieldGenerator', 'graph': [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]], 'width': 20}]
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
        locref:
          type: WeightedHuberCriterion
          weight: 0.05
        paf:
          type: WeightedHuberCriterion
          weight: 0.1
      heatmap_config:
        channels: [2304, 15]
        kernel_size: [3]
        strides: [2]
      locref_config:
        channels: [2304, 30]
        kernel_size: [3]
        strides: [2]
      paf_config:
        channels: [2304, 32]
        kernel_size: [3]
        strides: [2]
      num_stages: 5
    identity:
      type: HeatmapHead
      predictor:
        type: HeatmapPredictor
        location_refinement: False
      target_generator:
        type: HeatmapPlateauGenerator
        num_heatmaps: 2
        pos_dist_thresh: 17
        heatmap_mode: INDIVIDUAL
        generate_locref: False
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
      heatmap_config:
        channels: [2304, 2]
        kernel_size: [3]
        strides: [2]
net_type: dlcrnet_stride16_ms5
runner:
  type: PoseTrainingRunner
  gpus: None
  key_metric: test.mAP
  key_metric_asc: True
  eval_interval: 25
  optimizer:
    type: AdamW
    params:
      lr: 0.0001
  scheduler:
    type: LRListScheduler
    params:
      lr_list: [[1e-05], [1e-06]]
      milestones: [90, 120]
  snapshots:
    max_snapshots: 5
    save_epochs: 50
    save_optimizer_state: False
train_settings:
  batch_size: 16
  dataloader_workers: 0
  dataloader_pin_memory: True
  display_iters: 1000
  epochs: 200
  seed: 42
Loading pretrained weights from Hugging Face hub (timm/resnet50.a1_in1k)
[timm/resnet50.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
Data Transforms:
  Training:   Compose([
  Affine(always_apply=False, p=0.5, interpolation=1, mask_interpolation=0, cval=0, mode=0, scale={'x': (1.0, 1.0), 'y': (1.0, 1.0)}, translate_percent=None, translate_px={'x': (0, 0), 'y': (0, 0)}, rotate=(-30, 30), fit_output=False, shear={'x': (0.0, 0.0), 'y': (0.0, 0.0)}, cval_mask=0, keep_ratio=True, rotate_method='largest_box'),
  GaussNoise(always_apply=False, p=0.5, var_limit=(0, 162.5625), per_channel=True, mean=0),
  Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
  Validation: Compose([
  Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
Using custom collate function: {'type': 'ResizeFromDataSizeCollate', 'min_scale': 0.4, 'max_scale': 1.0, 'min_short_side': 128, 'max_short_side': 1152, 'multiple_of': 32, 'to_square': False}
Using 102 images and 44 for testing

Starting pose model training...
--------------------------------------------------
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return F.conv2d(input, weight, bias, self.stride,
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Epoch 1/200 (lr=0.0001), train loss 0.35200
Epoch 2/200 (lr=0.0001), train loss 0.16082
Epoch 3/200 (lr=0.0001), train loss 0.13178
Epoch 4/200 (lr=0.0001), train loss 0.08032
Epoch 5/200 (lr=0.0001), train loss 0.04940
Epoch 6/200 (lr=0.0001), train loss 0.04160
Epoch 7/200 (lr=0.0001), train loss 0.04610
Epoch 8/200 (lr=0.0001), train loss 0.05107
Epoch 9/200 (lr=0.0001), train loss 0.03826
Epoch 10/200 (lr=0.0001), train loss 0.03448
Epoch 11/200 (lr=0.0001), train loss 0.02770
Epoch 12/200 (lr=0.0001), train loss 0.02214
Epoch 13/200 (lr=0.0001), train loss 0.02729
Epoch 14/200 (lr=0.0001), train loss 0.03104
Epoch 15/200 (lr=0.0001), train loss 0.02087
Epoch 16/200 (lr=0.0001), train loss 0.03396
Epoch 17/200 (lr=0.0001), train loss 0.02404
Epoch 18/200 (lr=0.0001), train loss 0.02167
Epoch 19/200 (lr=0.0001), train loss 0.02466
Epoch 20/200 (lr=0.0001), train loss 0.02104
Epoch 21/200 (lr=0.0001), train loss 0.02200
Epoch 22/200 (lr=0.0001), train loss 0.02281
Epoch 23/200 (lr=0.0001), train loss 0.02097
Epoch 24/200 (lr=0.0001), train loss 0.01914
Training for epoch 25 done, starting evaluation
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:74: RuntimeWarning: Mean of empty slice
  distance_matrix[i, j] = np.nanmean(d)
{
	"name": "ValueError",
	"message": "matrix contains invalid numeric entries",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 deeplabcut.train_network(config_path, shuffle=0)

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/compat.py:245, in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights, modelprefix, superanimal_name, superanimal_transfer_learning, engine, **torch_kwargs)
    242     if \"display_iters\" not in torch_kwargs:
    243         torch_kwargs[\"display_iters\"] = displayiters
--> 245     return train_network(
    246         config,
    247         shuffle=shuffle,
    248         trainingsetindex=trainingsetindex,
    249         modelprefix=modelprefix,
    250         max_snapshots_to_keep=max_snapshots_to_keep,
    251         **torch_kwargs,
    252     )
    254 raise NotImplementedError(f\"This function is not implemented for {engine}\")

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:336, in train_network(config, shuffle, trainingsetindex, modelprefix, device, snapshot_path, detector_path, batch_size, epochs, save_epochs, detector_batch_size, detector_epochs, detector_save_epochs, display_iters, max_snapshots_to_keep, pose_threshold, **kwargs)
    323     detector_run_config[\"train_settings\"][\"weight_init\"] = loader.model_cfg[
    324         \"train_settings\"
    325     ].get(\"weight_init\")
    326     train(
    327         loader=loader,
    328         run_config=detector_run_config,
   (...)
    333         max_snapshots_to_keep=max_snapshots_to_keep,
    334     )
--> 336 train(
    337     loader=loader,
    338     run_config=loader.model_cfg,
    339     task=pose_task,
    340     device=device,
    341     logger_config=loader.model_cfg.get(\"logger\"),
    342     snapshot_path=snapshot_path,
    343     max_snapshots_to_keep=max_snapshots_to_keep,
    344 )
    346 destroy_file_logging()

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:189, in train(loader, run_config, task, device, gpus, logger_config, snapshot_path, transform, inference_transform, max_snapshots_to_keep)
    186 else:
    187     logging.info(\"\
Starting pose model training...\
\" + (50 * \"-\"))
--> 189 runner.fit(
    190     train_dataloader,
    191     valid_dataloader,
    192     epochs=run_config[\"train_settings\"][\"epochs\"],
    193     display_iters=run_config[\"train_settings\"][\"display_iters\"],
    194 )

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:181, in TrainingRunner.fit(self, train_loader, valid_loader, epochs, display_iters)
    179 with torch.no_grad():
    180     logging.info(f\"Training for epoch {e} done, starting evaluation\")
--> 181     valid_loss = self._epoch(
    182         valid_loader, mode=\"eval\", display_iters=display_iters
    183     )
    184     if self._print_valid_loss:
    185         msg += f\", valid loss {float(valid_loss):.5f}\"

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:236, in TrainingRunner._epoch(self, loader, mode, display_iters)
    234 perf_metrics = None
    235 if mode == \"eval\":
--> 236     perf_metrics = self._compute_epoch_metrics()
    237     self._metadata[\"metrics\"] = perf_metrics
    238     self._epoch_predictions = {}

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:365, in PoseTrainingRunner._compute_epoch_metrics(self)
    358 \"\"\"Computes the metrics using the data accumulated during an epoch
    359 Returns:
    360     A dictionary containing the different losses for the step
    361 \"\"\"
    362 num_animals = max(
    363     [len(kpts) for kpts in self._epoch_ground_truth[\"bodyparts\"].values()]
    364 )
--> 365 poses = pair_predicted_individuals_with_gt(
    366     self._epoch_predictions[\"bodyparts\"], self._epoch_ground_truth[\"bodyparts\"]
    367 )
    369 # pad predictions if there are any missing (needed for top-down models)
    370 gt, pred = {}, {}

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/metrics/scoring.py:391, in pair_predicted_individuals_with_gt(predictions, ground_truth)
    389 matched_poses = {}
    390 for image, pose in predictions.items():
--> 391     match_individuals = rmse_match_prediction_to_gt(pose, ground_truth[image])
    392     matched_poses[image] = pose[match_individuals]
    394 return matched_poses

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:76, in rmse_match_prediction_to_gt(pred_kpts, gt_kpts)
     73         d = (gt_idv[mask, :2] - pred_idv[mask, :2]) ** 2
     74         distance_matrix[i, j] = np.nanmean(d)
---> 76 _, col_ind = linear_sum_assignment(distance_matrix)  # len == len(valid_gt_indices)
     78 gt_idx_to_pred_idx = {
     79     valid_gt_indices[valid_gt_index]: valid_pred_indices[valid_pred_index]
     80     for valid_gt_index, valid_pred_index in enumerate(col_ind)
     81 }
     82 matched_pred = {valid_pred_indices[i] for i in col_ind}

ValueError: matrix contains invalid numeric entries"
}
```


### Anything else?

_No response_

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://github.com/DeepLabCut/DeepLabCut/blob/master/CODE_OF_CONDUCT.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631

Is there an existing issue for this?

Bug description

Operating System

DeepLabCut version

DeepLabCut mode

Device type

Steps To Reproduce

Relevant log output

Anything else?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Evaluation error with PAF heads: ValueError: matrix contains invalid numeric entries #2631

Description

Is there an existing issue for this?

Bug description

Operating System

DeepLabCut version

DeepLabCut mode

Device type

Steps To Reproduce

Relevant log output

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631