-
Notifications
You must be signed in to change notification settings - Fork 401
Open
Description
Ubuntu 20.04
Python: 3.10
CUDA: 12.0
GPU: 4090
Torch: 1.13 + cuda 11.7
Nvidia-driver: 525.85.12
Using: fp16 mixed precision (fp32 is fine)
I have tried various methods:
- install spconv_cu117 and cumm_cu117
- install spconv_cu120 and cumm_cu120
- build spconv and cumm
- build with JIT: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120
- build wheel: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120
but all end up this:
Traceback (most recent call last):
File "/home/derek/2DPASS/main.py", line 177, in <module>
trainer.fit(my_model, train_dataset_loader, val_dataset_loader)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1182, in _run_stage
self._run_train()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
self._run_sanity_check()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _run_sanity_check
val_loop.run()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1485, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/derek/2DPASS/network/base_model.py", line 183, in validation_step
data_dict = self.forward(data_dict)
File "/home/derek/2DPASS/network/arch_2dpass.py", line 176, in forward
data_dict = self.model_3d(data_dict)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/derek/2DPASS/network/spvcnn.py", line 175, in forward
enc_feats.append(self.spv_enc[i](data_dict)) # found spv_env[4] produce nan
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/derek/2DPASS/network/spvcnn.py", line 89, in forward
v_fea = self.v_enc(data_dict['sparse_tensor'])
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
input = module(input)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/derek/2DPASS/network/basic_block.py", line 35, in forward
output = self.layers(x)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
input = module(input)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 741, in forward
return self._conv_forward(self.training, input, self.weight, self.bias, add_input,
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 477, in _conv_forward
out_features, _, _ = ops.implicit_gemm(
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/ops.py", line 1513, in implicit_gemm
mask_width, tune_res_cpp = ConvGemmOps.implicit_gemm(
RuntimeError: /home/derek/tools/spconv/build/temp.linux-x86_64-cpython-310/spconv/build/core_cc/src/csrc/sparse/convops/convops/ConvTunerSimple/ConvTunerSimple_tune_and_cache.cc(103)
!all_profile_res.empty() assert faild. can't find suitable algorithm for 0Few things to notice:
- I reinstalled Ubuntu. It was working in previous Ubuntu system (22.04), I have copied previous miniconda folder over the new system. Maybe some residual file or corrupted file caused this?
- The error message above is captured when using wheel built on my system, but the last line is still pointing to a local file, not to a system folder. Looks very suspicious.
Metadata
Metadata
Assignees
Labels
No labels