Skip to content

Use TAILQ_FOREACH_SAFE whenever we do TAILQ_REMOVE during traversal#11

Closed
pbui wants to merge 1 commit intoNetworkConfiguration:masterfrom
pbui:tail-foreach-safe
Closed

Use TAILQ_FOREACH_SAFE whenever we do TAILQ_REMOVE during traversal#11
pbui wants to merge 1 commit intoNetworkConfiguration:masterfrom
pbui:tail-foreach-safe

Conversation

@pbui
Copy link
Contributor

@pbui pbui commented Oct 29, 2019

We must use TAILQ_FOREACH_SAFE instead of TAILQ_FOREACH if we call TAILQ_REMOVE on the current item and free it.

Failure to do so can lead to bad behavior such as infinite loops.

A few users of Void Linux are currently experiencing high CPU usage due to this bug in 8.1.1: void-linux/void-packages#15858

https://www.reddit.com/r/voidlinux/comments/donf2w/dhcpcd_consuming_100cpu/

It is difficult to reliably trigger this bug, but when it does occur, a backtrace showed that the process was stuck in ipv4ll_drop which calls ipv4_deladdr, which in turn had the unsafe traversal where it calls TAILQ_REMOVE inside a TAILQ_FOREACH rather than TAILQ_FOREACH_SAFE.

Further auditing of the code should be performed to ensure the proper loop traversal macro is used to avoid such bugs (I fixed three instances where I believe TAILQ_FOREACH_SAFE should be used instead of TAILQ_FOREACH, but there may be others).

We must use TAILQ_FOREACH_SAFE instead of TAILQ_FOREACH if we call
TAILQ_REMOVE on the current item and free it.

Failure to do so can lead to bad behavior such as infinite loops.
Copy link
Member

@rsmarples rsmarples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are not necessary.


/* Remove existing timeout if present. */
TAILQ_FOREACH(t, &eloop->timeouts, next) {
TAILQ_FOREACH_SAFE(t, &eloop->timeouts, next, tt) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed because we break from the loop immediately after removing the matching item.


state = IPV4_STATE(addr->iface);
TAILQ_FOREACH(ap, &state->addrs, next) {
TAILQ_FOREACH_SAFE(ap, &state->addrs, next, ian) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed because we break from the loop after removing the matching address.


state = IPV6_STATE(ia->iface);
TAILQ_FOREACH(ap, &state->addrs, next) {
TAILQ_FOREACH_SAFE(ap, &state->addrs, next, ian) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed because we break from the loop after removing the matching address.

@rsmarples
Copy link
Member

I suspect the issue you are seeing is already fixed by commits 09dc4d0 and f7e2c24.
If you agree, please close this PR.

@pbui
Copy link
Contributor Author

pbui commented Oct 29, 2019

I suspect the issue you are seeing is already fixed by commits 09dc4d0 and f7e2c24.
If you agree, please close this PR.

Those commits do not resolve the high CPU usage problem. As you can see in https://github.com/void-linux/void-packages/pull/15858/files, I included those commits in what I thought should fix the bug, but it had no effect as I was still able to experience the high CPU after switching networks.

@pbui
Copy link
Contributor Author

pbui commented Oct 29, 2019

I actually managed to trigger the bug with these changes... so this is clearly not the fix. However, I was able to produce a backtrace:

#0  ipv4_iffindaddr (ifp=ifp@entry=0x560f925376b0, addr=addr@entry=0x560f9251a948, mask=mask@entry=0x0)
    at ipv4.c:130
#1  0x0000560f9110d21b in arp_announceaddr (ctx=<optimized out>, ia=ia@entry=0x560f9251a948)
    at arp.c:542
#2  0x0000560f9110b790 in ipv4_applyaddr (arg=arg@entry=0x560f92537190) at ipv4.c:747
#3  0x0000560f911060a2 in dhcp_drop (ifp=0x560f92537190, reason=0x560f911206d7 "EXPIRE") at dhcp.c:2713
#4  0x0000560f910f0030 in dhcpcd_drop (ifp=0x560f92537190, stop=<optimized out>) at dhcpcd.c:412
#5  0x0000560f91102f89 in link_netlink (arg=0x7fffe5308700, nlm=0x7fffe5304620, ctx=0x7fffe5308700)
    at if-linux.c:895
#6  link_netlink (ctx=ctx@entry=0x7fffe5308700, arg=arg@entry=0x0, nlm=nlm@entry=0x7fffe5304620)
    at if-linux.c:798
#7  0x0000560f9110139a in get_netlink (ctx=0x7fffe5308700, iov=iov@entry=0x7fffe5304610, 
    arg=arg@entry=0x0, fd=8, flags=flags@entry=64, 
    callback=callback@entry=0x560f91102b00 <link_netlink>) at if-linux.c:476
#8  0x0000560f91101c99 in if_handlelink (ctx=<optimized out>) at if-linux.c:910
#9  0x0000560f910f2139 in dhcpcd_handlelink (arg=0x7fffe5308700) at dhcpcd.c:1089
#10 0x0000560f910f3bbb in eloop_start (eloop=0x560f92535c70, signals=0x7fffe5308810) at eloop.c:997
#11 0x0000560f910ee4c9 in main (argc=<optimized out>, argv=<optimized out>) at dhcpcd.c:2110

It looks like it is stuck in this loop in ipv4_iffindaddr:

                TAILQ_FOREACH(ap, &state->addrs, next) {                                                
                        if ((addr == NULL || ap->addr.s_addr == addr->s_addr) &&
                            (mask == NULL || ap->mask.s_addr == mask->s_addr))
                                return ap; 
                }

I can close this PR since it does not address the issue, but how should I proceed on the bug itself?

@pbui
Copy link
Contributor Author

pbui commented Oct 29, 2019

So what appears to happen is that when I transition from wireless to wireless + ethernet (or wireless + ethernet to just wireless), I get stuck in ifv4_iffindaddr. As you can see from the annotated backtrace below (paired with the output of gdb above), ipv4_iffindaddr gets stuck in the loop because:

  1. state->addrs appears to be circular (tqh_first and tqh_last point to the same object).

  2. state->addrs contains an address (ap->addr.s_addr) related to eth0 even though the addr is related to wlan0.

Because of this, a match is never found and it loops forever...

But there are other strange properties. For instance, link_netlink, dhcpcd_drop, and dhcp_drop have an ifp argument of eth0 but when we look up the lease->addr.s_addr of eth0, we get an IP address (10.36.27.71) associated with wlan0 (originally eth0 was 129.74.161.60).

Moreover, in arp_announceaddr even though we are reacting to changes in eth0, we call ipv4_iffindaddr on wlan0.

I'm not sure what to make of this or how to fix it... but I hope this helps in tracking down this bug.

Here is some more context (I printed out select values for variables) on that backtrace...

ipv4_iffindaddr:

    ifp->name		    wlan0

    ap->addr.s_addr	    1192961034	    129.74.161.60   # OLD eth0	

    addr.s_addr		    1017203329	    10.36.27.71	    #  CURRENT wlan0

    state->addrs	    {tqh_first = 0x560f9251af00, tqh_last = 0x560f9251af00}

# Why does IPV4_STATE return addresses associated with eth0 instead of wlan0?

arp_announceaddr:

    ifp->name		    wlan0
    ia->s_addr		    1017203329	    10.36.27.71

# Should we announce on wlan0 even though eth0 is the device in question?

ipv4_applyaddr:

    ifp->name		    eth0
    (&lease->addr).s_addr   1017203329	    10.36.27.71

# Why does eth0 have lease address of 10.36.27.71?

dhcp_drop:

    ifp->name		    eth0

dhcpd_drop:

    ifp->name		    eth0

link_netlink:

    ifn			    eth0

@pbui
Copy link
Contributor Author

pbui commented Oct 29, 2019

Also, here are the logs from the process:

wlan0: renewing lease of 10.36.27.71
wlan0: sending REQUEST (xid 0xf9756657), next in 3.8 seconds
wlan0: acknowledged 10.36.27.71 from 66.205.160.60
wlan0: leased 10.36.27.71 for 7200 seconds
wlan0: renew in 3600 seconds, rebind in 6300 seconds
wlan0: writing lease `/var/db/dhcpcd/wlan0-eduroam.lease'
wlan0: adding IP address 10.36.27.71/21 broadcast 10.36.31.255
wlan0: ARP announcing 10.36.27.71 (1 of 2), next in 2.0 seconds
wlan0: executing `/usr/libexec/dhcpcd-run-hooks' RENEW
wlan0: ARP announcing 10.36.27.71 (2 of 2)
eth0: carrier lost
eth0: executing `/usr/libexec/dhcpcd-run-hooks' NOCARRIER
eth0: deleting IP address 129.74.161.60/23
eth0: deleting route to 129.74.160.0/23
eth0: deleting default route via 129.74.160.1

After this, the process hangs with the infinite loop.

@yorickvP yorickvP mentioned this pull request Nov 1, 2019
10 tasks
@rsmarples
Copy link
Member

This should be fixed in 73ac184.
Please verify

@pbui
Copy link
Contributor Author

pbui commented Nov 7, 2019

Looks like that fixed it... Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants