-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is there an existing issue for this?
- I have searched the existing issues
Bug description
I'm trying to do transfer learning with the Superanimal Quadruped model but I keep getting a couple errors and it won't train. Maybe it's an install issue but I'm not sure and don't know how to resolve it.
- it throws a notice that I set a batch size of 1 or freeze bn stats = false, neither of which is accurate according to both config files. The config files are attached as text files (won't let me upload .yaml).
- When it begins start object detector training, a series of traceback errors are thrown and it does not train. I'm attaching the log output.
I have tried on two different servers and rebuilt the environment several times. I have verified nvidia-smi shows the GPU and that torch.cuda.is_available shows true in the environment. I have tried with the latest version of DLC. I have also tried with Pytorch 12.1 and 12.4 in case a newer version was needed. I have followed the regular installation guide in the docs and also the instructions at #2613 to no avail.
I am coming up to a submission deadline so help is really appreciated!
Complete_Log_DLC_Failing.txt
config.txt
pose_cfg.txt
pytorch_config.txt
Operating System
Windows 2022 Server
DeepLabCut version
deeplabcut-3.0.0rc1
DeepLabCut mode
single animal
Device type
gpu, NVIDIA RTX A4000. I also tried a different server with NVIDIA A10.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.42 Driver Version: 537.42 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 WDDM | 00000000:21:00.0 Off | Off |
| 41% 30C P8 9W / 140W | 508MiB / 16376MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 4872 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 8696 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 9176 C+G ...m Files\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 10988 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 11060 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 12368 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 17328 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A |
+---------------------------------------------------------------------------------------+
Steps To Reproduce
- follow the instructions at: 🔥 New main backend Deep Learning Library (Engine) for DeepLabCut: PyTorch #2613 for installation
- Open GUI
- Split dataset 80/20
- Then go to train network. It will return the log I am adding below. The full log output is attached in the bug description.
Relevant log output
Note: According to your model configuration, you're training with batch size 1 and/or ``freeze_bn_stats=false``. This is not an optimal setting if you have powerful GPUs.
This is good for small batch sizes (e.g., when training on a CPU), where you should keep ``freeze_bn_stats=true``.
If you're using a GPU to train, you can obtain faster performance by setting a larger batch size (the biggest power of 2 where you don't geta CUDA out-of-memory error, such as 8, 16, 32 or 64 depending on the model, size of your images, and GPU memory) and ``freeze_bn_stats=false`` for the backbone of your model.
This also allows you to increase the learning rate (empirically you can scale the learning rate by sqrt(batch_size) times).
Using 628 images and 157 for testing
Starting object detector training...
--------------------------------------------------
Traceback (most recent call last):
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\gui\tabs\train_network.py", line 190, in train_network
compat.train_network(config, shuffle, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\compat.py", line 245, in train_network
return train_network(
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\pose_estimation_pytorch\apis\train.py", line 326, in train_network
train(
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\pose_estimation_pytorch\apis\train.py", line 189, in train
runner.fit(
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\pose_estimation_pytorch\runners\train.py", line 170, in fit
train_loss = self._epoch(
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\pose_estimation_pytorch\runners\train.py", line 221, in _epoch
losses_dict = self.step(batch, mode)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\pose_estimation_pytorch\runners\train.py", line 503, in step
losses, predictions = self.model(images, target)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\deeplabcut\pose_estimation_pytorch\models\detectors\fasterRCNN.py", line 106, in forward
return self.model(x, targets)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torchvision\models\detection\generalized_rcnn.py", line 105, in forward
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Other_Program_Files\miniforge3\envs\deeplabcut3\lib\site-packages\torchvision\models\detection\roi_heads.py", line 749, in forward
raise TypeError(f"target labels must of int64 type, instead got {t['labels'].dtype}")
TypeError: target labels must of int64 type, instead got torch.int32Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct