Skip to content

ConvTunerSimple_tune_and_cache.cc -> can't find suitable algorithm for 0 #563

@callzhang

Description

@callzhang

Ubuntu 20.04
Python: 3.10
CUDA: 12.0
GPU: 4090
Torch: 1.13 + cuda 11.7
Nvidia-driver: 525.85.12
Using: fp16 mixed precision (fp32 is fine)

I have tried various methods:

  • install spconv_cu117 and cumm_cu117
  • install spconv_cu120 and cumm_cu120
  • build spconv and cumm
  • build with JIT: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120
  • build wheel: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120

but all end up this:

Traceback (most recent call last):
  File "/home/derek/2DPASS/main.py", line 177, in <module>
    trainer.fit(my_model, train_dataset_loader, val_dataset_loader)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1182, in _run_stage
    self._run_train()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
    self._run_sanity_check()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _run_sanity_check
    val_loop.run()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1485, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/derek/2DPASS/network/base_model.py", line 183, in validation_step
    data_dict = self.forward(data_dict)
  File "/home/derek/2DPASS/network/arch_2dpass.py", line 176, in forward
    data_dict = self.model_3d(data_dict)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/derek/2DPASS/network/spvcnn.py", line 175, in forward
    enc_feats.append(self.spv_enc[i](data_dict)) # found spv_env[4] produce nan
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/derek/2DPASS/network/spvcnn.py", line 89, in forward
    v_fea = self.v_enc(data_dict['sparse_tensor'])
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
    input = module(input)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/derek/2DPASS/network/basic_block.py", line 35, in forward
    output = self.layers(x)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
    input = module(input)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 741, in forward
    return self._conv_forward(self.training, input, self.weight, self.bias, add_input,
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 477, in _conv_forward
    out_features, _, _ = ops.implicit_gemm(
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/ops.py", line 1513, in implicit_gemm
    mask_width, tune_res_cpp = ConvGemmOps.implicit_gemm(
RuntimeError: /home/derek/tools/spconv/build/temp.linux-x86_64-cpython-310/spconv/build/core_cc/src/csrc/sparse/convops/convops/ConvTunerSimple/ConvTunerSimple_tune_and_cache.cc(103)
!all_profile_res.empty() assert faild. can't find suitable algorithm for 0

Few things to notice:

  1. I reinstalled Ubuntu. It was working in previous Ubuntu system (22.04), I have copied previous miniconda folder over the new system. Maybe some residual file or corrupted file caused this?
  2. The error message above is captured when using wheel built on my system, but the last line is still pointing to a local file, not to a system folder. Looks very suspicious.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions