Desktop crash issue

hi, we meet another desktop crash issue.
we use opencv to display camera data, at the same time, execute nvidia-smi about 1 time/second.
seems the driver meet deadlock issue.

we use R38.2.1 version.

this issue is different with below issue. we already set gpu with performance mode to disable devfreq to work.

dmesg.txt (131.6 KB)

[ 3144.146537] INFO: task nvidia-modeset/:1815 blocked for more than 120 seconds.
[ 3144.146565] Tainted: G W O 6.8.12-tegra #1
[ 3144.146579] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 3144.153846] task:nvidia-modeset/ state:D stack:0 pid:1815 tgid:1815 ppid:2 flags:0x00000208
[ 3144.153857] Call trace:
[ 3144.153859] __switch_to+0xe0/0x108
[ 3144.153878] __schedule+0x368/0xbe4
[ 3144.153885] schedule+0x34/0xc8
[ 3144.153891] schedule_timeout+0x1a4/0x1b4
[ 3144.153897] __down_common+0x104/0x218
[ 3144.153905] __down+0x18/0x24
[ 3144.153912] down+0x50/0x6c
[ 3144.153915] nvkms_kthread_q_callback+0x90/0x17c [nvidia_modeset]
[ 3144.154014] _main_loop+0x90/0x14c [nvidia_modeset]
[ 3144.154086] kthread+0x110/0x114
[ 3144.154092] ret_from_fork+0x10/0x20
[ 3144.154225] INFO: task nvidia-smi:11295 blocked for more than 120 seconds.
[ 3144.161176] Tainted: G W O 6.8.12-tegra #1
[ 3144.166707] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 3144.176082] task:nvidia-smi state:D stack:0 pid:11295 tgid:11278 ppid:11275 flags:0x00000204
[ 3144.176093] Call trace:
[ 3144.176095] __switch_to+0xe0/0x108
[ 3144.176105] __schedule+0x368/0xbe4
[ 3144.176110] schedule+0x34/0xc8
[ 3144.176116] schedule_preempt_disabled+0x24/0x40
[ 3144.176122] rwsem_down_read_slowpath+0x214/0x51c
[ 3144.176125] down_read+0xa0/0xa8
[ 3144.176128] os_acquire_rwlock_read+0x38/0x64 [nvidia]
[ 3144.176671] portSyncRwLockAcquireRead+0x10/0x30 [nvidia]
[ 3144.177216] rmapiLockAcquire+0x29c/0x320 [nvidia]
[ 3144.178004] rmapiPrologue+0x124/0x180 [nvidia]
[ 3144.178388] _rmapiRmControl+0x408/0x6a0 [nvidia]
[ 3144.178844] rmapiControlWithSecInfo+0xa8/0x150 [nvidia]
[ 3144.179203] rmapiControlWithSecInfoTls+0x74/0xe0 [nvidia]
[ 3144.179553] _nv04ControlWithSecInfo.constprop.0+0x80/0xa0 [nvidia]
[ 3144.179997] Nv04ControlWithSecInfo+0x34/0x40 [nvidia]
[ 3144.180330] RmIoctl+0x884/0xbf0 [nvidia]
[ 3144.180687] rm_ioctl+0x64/0x430 [nvidia]
[ 3144.181031] nvidia_unlocked_ioctl+0x664/0x76c [nvidia]
[ 3144.181458] __arm64_sys_ioctl+0xac/0xf0
[ 3144.181467] invoke_syscall+0x48/0x114
[ 3144.181474] el0_svc_common.constprop.0+0xc0/0xe0
[ 3144.181479] do_el0_svc+0x1c/0x28
[ 3144.181484] el0_svc+0x30/0xa8
[ 3144.181488] el0t_64_sync_handler+0x120/0x12c
[ 3144.181492] el0t_64_sync+0x194/0x198

we use cv::imshow() to display camera info, and execute nvidia-smi all the time.
below is the crash info, the desktop is dead.

below is another case that meet crash problem. we don’t execute nvidia-smi, just use opencv to display camera data.
[ 485.842404] INFO: task nvidia-modeset/:1825 blocked for more than 120 seconds.
[ 485.842432] Tainted: G W O 6.8.12-tegra #1
[ 485.842444] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 485.849426] task:nvidia-modeset/ state:D stack:0 pid:1825 tgid:1825 ppid:2 flags:0x00000208
[ 485.849435] Call trace:
[ 485.849437] __switch_to+0xe0/0x108
[ 485.849452] __schedule+0x368/0xbe4
[ 485.849459] schedule+0x34/0xc8
[ 485.849465] schedule_timeout+0x1a4/0x1b4
[ 485.849469] __down_common+0x104/0x218
[ 485.849476] __down+0x18/0x24
[ 485.849482] down+0x50/0x6c
[ 485.849485] nvkms_kthread_q_callback+0x90/0x17c [nvidia_modeset]
[ 485.849583] _main_loop+0x90/0x14c [nvidia_modeset]
[ 485.849658] kthread+0x110/0x114
[ 485.849666] ret_from_fork+0x10/0x20
[ 606.674949] INFO: task nvidia-modeset/:1825 blocked for more than 241 seconds.
[ 606.674977] Tainted: G W O 6.8.12-tegra #1
[ 606.674990] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 606.681979] task:nvidia-modeset/ state:D stack:0 pid:1825 tgid:1825 ppid:2 flags:0x00000208
[ 606.681988] Call trace:
[ 606.681990] __switch_to+0xe0/0x108
[ 606.682009] __schedule+0x368/0xbe4
[ 606.682016] schedule+0x34/0xc8
[ 606.682023] schedule_timeout+0x1a4/0x1b4
[ 606.682029] __down_common+0x104/0x218
[ 606.682037] __down+0x18/0x24
[ 606.682044] down+0x50/0x6c
[ 606.682047] nvkms_kthread_q_callback+0x90/0x17c [nvidia_modeset]
[ 606.682147] _main_loop+0x90/0x14c [nvidia_modeset]
[ 606.682224] kthread+0x110/0x114
[ 606.682232] ret_from_fork+0x10/0x20

when desktop crash issue happen, xorg and gnome-shell take 100% of a cpu core.

20251124-101133

when desktop crash happen, this thread enter D status.
[ 693.291882] task:irq/238-host1x_ state:D stack:0 pid:347 tgid:347 ppid:2 flags:0x00000008
[ 693.291884] Call trace:
[ 693.291885] __switch_to+0xe0/0x110
[ 693.291886] __schedule+0x3dc/0xbd8
[ 693.291888] schedule+0x3c/0x108
[ 693.291889] schedule_timeout+0xa8/0x1d0
[ 693.291891] __wait_for_common+0xe4/0x218
[ 693.291893] wait_for_completion_timeout+0x28/0x40
[ 693.291895] tegra_bpmp_transfer+0x1d0/0x3f8
[ 693.291900] tegra264_mc_icc_set+0xf4/0x1a8
[ 693.291904] apply_constraints+0x74/0xc0
[ 693.291907] icc_set_bw+0xbc/0x2d0
[ 693.291909] vic_devfreq_target+0x70/0x110 [tegra_drm]
[ 693.291924] devfreq_set_target+0x98/0x230
[ 693.291926] devfreq_update_target+0xc8/0xe8
[ 693.291927] update_devfreq+0x1c/0x30
[ 693.291929] vic_actmon_event+0x58/0x88 [tegra_drm]
[ 693.291937] host1x_actmon_handle_interrupt+0xec/0x148 [host1x]
[ 693.291951] host1x_general_isr+0x60/0x68 [host1x]
[ 693.291959] irq_thread_fn+0x34/0xb8
[ 693.291962] irq_thread+0x1a0/0x2a0
[ 693.291963] kthread+0x124/0x130

if set /sys/class/devfreq/8188050000.vic/governor as performance, still can reproduce this issue.

[ 693.291882] task:irq/238-host1x_ state:D stack:0 pid:347 tgid:347 ppid:2 flags:0x00000008
[ 693.291884] Call trace:
[ 693.291885] __switch_to+0xe0/0x110
[ 693.291886] __schedule+0x3dc/0xbd8
[ 693.291888] schedule+0x3c/0x108
[ 693.291889] schedule_timeout+0xa8/0x1d0
[ 693.291891] __wait_for_common+0xe4/0x218
[ 693.291893] wait_for_completion_timeout+0x28/0x40
[ 693.291895] tegra_bpmp_transfer+0x1d0/0x3f8
[ 693.291900] tegra264_mc_icc_set+0xf4/0x1a8
[ 693.291904] apply_constraints+0x74/0xc0
[ 693.291907] icc_set_bw+0xbc/0x2d0
[ 693.291909] vic_devfreq_target+0x70/0x110 [tegra_drm]
[ 693.291924] devfreq_set_target+0x98/0x230
[ 693.291926] devfreq_update_target+0xc8/0xe8
[ 693.291927] update_devfreq+0x1c/0x30
[ 693.291929] vic_actmon_event+0x58/0x88 [tegra_drm]
[ 693.291937] host1x_actmon_handle_interrupt+0xec/0x148 [host1x]
[ 693.291951] host1x_general_isr+0x60/0x68 [host1x]
[ 693.291959] irq_thread_fn+0x34/0xb8
[ 693.291962] irq_thread+0x1a0/0x2a0
[ 693.291963] kthread+0x124/0x130
[ 693.291965] ret_from_fork+0x10/0x20

i add some log at nvkms_kthread_q_callback() and nvkms_ioctl_common()

xorg get the lock, didn’t release. and then kthread_q_callback request the lock, hang issue happen.

[ 74.782799] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xf, pid=2906, comm=gnome-shell)
[ 74.782809] NVKMS_DEBUG: IOCTL got the lock!
[ 74.782942] NVKMS_DEBUG: IOCTL released the lock.
[ 74.798872] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xf, pid=2906, comm=gnome-shell)
[ 74.798882] NVKMS_DEBUG: IOCTL got the lock!
[ 74.799004] NVKMS_DEBUG: IOCTL released the lock.
[ 74.815646] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xf, pid=2906, comm=gnome-shell)
[ 74.815653] NVKMS_DEBUG: IOCTL got the lock!
[ 74.815775] NVKMS_DEBUG: IOCTL released the lock.
[ 74.817778] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xe, pid=2488, comm=Xorg)
[ 74.817786] NVKMS_DEBUG: IOCTL got the lock!
[ 74.820660] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 512
[ 76.244142] NVKMS_DEBUG: kthread_q_callback called (pid=1739, comm=nvidia-modeset/)

continue add some log, i found the program stuck at IsChannelMethodPending(),we don’t have the source code, please help check it.

[  277.312703] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xf, pid=2752, comm=gnome-shell)
[  277.312713] NVKMS_DEBUG: IOCTL got the lock!
[  277.312834] NVKMS_DEBUG: IOCTL released the lock.
[  277.329335] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xf, pid=2752, comm=gnome-shell)
[  277.329345] NVKMS_DEBUG: IOCTL got the lock!
[  277.329472] NVKMS_DEBUG: IOCTL released the lock.
[  277.337710] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0x32, pid=2378, comm=Xorg)
[  277.337734] NVKMS_DEBUG: IOCTL got the lock!
[  277.337770] NVKMS_DEBUG: IOCTL released the lock.
[  277.340419] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xe, pid=2378, comm=Xorg)
[  277.340427] NVKMS_DEBUG: IOCTL got the lock!
[  277.340443] nvidia-modeset: IdleBaseChannel enter, pOpenDev=000000001cd3df01

[  277.340450] nvidia-modeset: IdleBaseChannelAll enter type:0

[  277.340456] nvidia-modeset: IdleBaseChannelAll loop iteration1
[  277.340465] nvidia-modeset: nvIdleMainLayerChannelCheckIdleOneApiHead enter

[  278.993780] NVKMS_DEBUG: kthread_q_callback called (pid=1776, comm=nvidia-modeset/)
[  448.978815] INFO: task nvidia-modeset/:1776 blocked for more than 147 seconds.
[  448.978843]       Tainted: G        W  O       6.8.12-tegra #2
[  448.978856] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[  448.985930] task:nvidia-modeset/ state:D stack:0     pid:1776  tgid:1776  ppid:2      flags:0x00000208
[  448.985939] Call trace:
[  448.985942]  __switch_to+0xe0/0x110
[  448.985956]  __schedule+0x3dc/0xbd8
[  448.985962]  schedule+0x3c/0x108
[  448.985966]  schedule_timeout+0x1b4/0x1d0
[  448.985977]  __down_common+0x138/0x260
[  448.985983]  __down+0x20/0x38
[  448.985988]  down+0x6c/0x90
[  448.985993]  nvkms_kthread_q_callback+0xf8/0x1f0 [nvidia_modeset]
[  448.986099]  _main_loop+0xa4/0x168 [nvidia_modeset]
[  448.986179]  kthread+0x124/0x130
[  448.986188]  ret_from_fork+0x10/0x20
NvBool nvIdleMainLayerChannelCheckIdleOneApiHead(NVDispEvoPtr pDispEvo,

                                                 NvU32 apiHead)

{

    NVDevEvoPtr pDevEvo = pDispEvo->pDevEvo;

    const NVDispApiHeadStateEvoRec *pApiHeadState =

        &pDispEvo->apiHeadState[apiHead];

    NvU32 head;

    nvEvoLog(EVO_LOG_INFO, "nvIdleMainLayerChannelCheckIdleOneApiHead enter\n");

    FOR_EACH_EVO_HW_HEAD_IN_MASK(pApiHeadState->hwHeadsMask, head) {

        NVEvoChannelPtr pMainLayerChannel =

            pDevEvo->head[head].layer[NVKMS_MAIN_LAYER];

        NvBool isMethodPending = FALSE;

        NvBool ret;




        ret = pDevEvo->hal->IsChannelMethodPending(pDevEvo, pMainLayerChannel,

            pDispEvo->displayOwner, &isMethodPending);




        if (ret && isMethodPending) {

            nvEvoLog(EVO_LOG_INFO, "nvIdleMainLayerChannelCheckIdleOneApiHead exit1\n");

            return FALSE;

        }

    }

    nvEvoLog(EVO_LOG_INFO, "nvIdleMainLayerChannelCheckIdleOneApiHead exit2\n");

    return TRUE;

}

continue add some log, find stuck at below function.

src/nvidia-modeset/src/nvkms-evo3.c:7029:NvBool nvEvoIsChannelMethodPendingC3

[  248.336164] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xb, pid=2665, comm=InputThread)
[  248.336168] NVKMS_DEBUG: IOCTL got the lock!
[  248.336174] NVKMS_DEBUG: IOCTL released the lock.
[  248.344496] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xb, pid=2665, comm=InputThread)
[  248.344502] NVKMS_DEBUG: IOCTL got the lock!
[  248.344510] NVKMS_DEBUG: IOCTL released the lock.
[  248.345082] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xe, pid=2558, comm=Xorg)
[  248.345091] NVKMS_DEBUG: IOCTL got the lock!
[  248.345098] nvidia-modeset: IdleBaseChannel enter, pOpenDev=0000000060803cbe

[  248.345102] nvidia-modeset: IdleBaseChannelAll enter type:0

[  248.345104] nvidia-modeset: IdleBaseChannelAll loop iteration1
[  248.345108] nvidia-modeset: nvIdleMainLayerChannelCheckIdleOneApiHead enter

[  248.345111] nvidia-modeset: nvEvoIsChannelMethodPendingC3 enter

[  248.352074] NVKMS_DEBUG: IOCTL trying to get lock (cmd=0xb, pid=2665, comm=InputThread)
[  250.328043] NVKMS_DEBUG: kthread_q_callback called (pid=1824, comm=nvidia-modeset/)
[  485.849461] INFO: task nvidia-modeset/:1824 blocked for more than 120 seconds.
[  485.849487]       Tainted: G        W  O       6.8.12-tegra #1
[  485.849500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  485.856481] task:nvidia-modeset/ state:D stack:0     pid:1824  tgid:1824  ppid:2      flags:0x00000208
[  485.856490] Call trace:
[  485.856493]  __switch_to+0xe0/0x108
[  485.856512]  __schedule+0x368/0xbe4
[  485.856520]  schedule+0x34/0xc8
[  485.856526]  schedule_timeout+0x1a4/0x1b4
[  485.856531]  __down_common+0x104/0x218
[  485.856539]  __down+0x18/0x24
[  485.856546]  down+0x50/0x6c
[  485.856549]  nvkms_kthread_q_callback+0xa8/0x198 [nvidia_modeset]
[  485.856654]  _main_loop+0x90/0x14c [nvidia_modeset]
[  485.856730]  kthread+0x110/0x114
[  485.856738]  ret_from_fork+0x10/0x20
[  485.856801] INFO: task nvidia-smi:4971 blocked for more than 120 seconds.
[  485.863518]       Tainted: G        W  O       6.8.12-tegra #1
[  485.869050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  485.869053] task:nvidia-smi      state:D stack:0     pid:4971  tgid:4948  ppid:4944   flags:0x00000204
[  485.869059] Call trace:
[  485.869061]  __switch_to+0xe0/0x108
[  485.877132]  __schedule+0x368/0xbe4
[  485.877135]  schedule+0x34/0xc8
[  485.877138]  schedule_preempt_disabled+0x24/0x40
[  485.877141]  rwsem_down_read_slowpath+0x214/0x51c
[  485.877143]  down_read+0xa0/0xa8
[  485.877144]  os_acquire_rwlock_read+0x38/0x64 [nvidia]
[  485.877380]  portSyncRwLockAcquireRead+0x10/0x30 [nvidia]
[  485.877605]  rmapiLockAcquire+0x29c/0x320 [nvidia]
[  485.877795]  rmapiPrologue+0x124/0x180 [nvidia]
[  485.877976]  _rmapiRmControl+0x408/0x6a0 [nvidia]
[  485.878153]  rmapiControlWithSecInfo+0xa8/0x150 [nvidia]
[  485.878324]  rmapiControlWithSecInfoTls+0x74/0xe0 [nvidia]
[  485.878500]  _nv04ControlWithSecInfo.constprop.0+0x80/0xa0 [nvidia]
[  485.878714]  Nv04ControlWithSecInfo+0x34/0x40 [nvidia]
[  485.878883]  RmIoctl+0x884/0xbf0 [nvidia]
[  485.879063]  rm_ioctl+0x64/0x430 [nvidia]
[  485.879239]  nvidia_unlocked_ioctl+0x664/0x76c [nvidia]
[  485.879458]  __arm64_sys_ioctl+0xac/0xf0
[  485.879468]  invoke_syscall+0x48/0x114
[  485.879476]  el0_svc_common.constprop.0+0xc0/0xe0
[  485.879483]  do_el0_svc+0x1c/0x28
[  485.879489]  el0_svc+0x30/0xa8
[  485.879496]  el0t_64_sync_handler+0x120/0x12c
[  485.879502]  el0t_64_sync+0x194/0x198