Skip to content

skynet占满cpu,但是所有服务都正常响应 #644

@jxfzlmb

Description

@jxfzlmb

skynet占满cpu,但是所有服务都正常响应

很难重现,只有一个现场,这个服务器大概跑了一周,发现cpu被占满

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
27510 game      20   0  252912  27224   1484 R 99.9  2.7   4941:45 skynet

但是所有的服务都正常,没有拥塞,任务很少,响应也都正确,应该不是死循环或者循环call问题。

stat
:00000004       cpu:0.001394    message:29      mqlen:0 task:0
:00000006       cpu:0.000784    message:11      mqlen:0 task:0
:00000007       cpu:0.001686    message:11      mqlen:0 task:0
:00000008       cpu:0.014091    message:29      mqlen:0 task:0
:00000009       cpu:0.000965    message:12      mqlen:0 task:1
:0000000a       cpu:0.012387    message:78      mqlen:0 task:1
:0000000b       cpu:0.041549    message:21      mqlen:0 task:1
:0000000c       cpu:0.760351    message:12507   mqlen:0 task:2
:0000000d       cpu:0.012948    message:12      mqlen:0 task:0
:0000000e       cpu:2.329448    message:39209   mqlen:0 task:4
:0000000f       cpu:0.08777     message:1539    mqlen:0 task:0
:00000010       cpu:0.016192    message:519     mqlen:0 task:0
:00000011       cpu:0.158207    message:3214    mqlen:0 task:0
:00000012       cpu:1.905713    message:17686   mqlen:0 task:6
:000002d3       cpu:0.040202    message:424     mqlen:0 task:1
:000002d5       cpu:0.031169    message:376     mqlen:0 task:1
:000002d6       cpu:0.046675    message:390     mqlen:0 task:1
:000002d7       cpu:0.012635    message:281     mqlen:0 task:0
:000002dd       cpu:0.007929    message:54      mqlen:0 task:1
<CMD OK>

另外发现占用cpu都在系统空间,看起来不是lua逻辑问题
%Cpu(s): 20.5 us, 79.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.3 st

后来发现占cpu的不是工作线程,好像是socket线程,backtrace了一下看到停在这。

#0  0x00007fcf9e6a0c53 in __atomic_preadv_replacement (fd=-1657083256, vector=0x7fcf9bfa5ccc, count=-1678091056, 
    offset=140529692707923) at ../sysdeps/posix/preadv.c:75
#1  0x000000000040c8fb in skynet_socket_poll () at skynet-src/skynet_socket.c:79
#2  0x000000000040b5d3 in thread_socket (p=0x7fcf9e21d280) at skynet-src/skynet_start.c:68
#3  0x00007fcf9f46a184 in start_thread (arg=0x7fcf9bfa6700) at pthread_create.c:312
#4  0x00007fcf9e6a937d in __ecvt_r (value=9.532824124368238e-130, ndigit=0, decpt=0x0, sign=0x0, 
    buf=0x7fcf9bfa69c0 "\300yz\234\317\177", len=140529651836672) at efgcvt_r.c:218
#5  0x0000000000000000 in ?? ()

但服务器网络是好的,所有的功能也都正常,不看cpu占用的话不会发现这个问题。

请问这可能会是什么问题?从哪个方向查?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions