-
Notifications
You must be signed in to change notification settings - Fork 38.6k
ci: detect outbound internet traffic generated while running tests #31349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31349. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsNo conflicts as of last run. |
54a6482 to
3ec89f4
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
|
Nice. Conecpt ACK! |
|
Concept ACK i'm slightly worried this may generate false positive. As is, this detects traffic on the entire (virtual) machine while running the tests. Are there no other daemons running on the CI instance that could interfere with this? |
3ec89f4 to
3c4b203
Compare
|
@laanwj, Right! And Another source of false positive could be if somebody from the outside initiates communication to the VM to which it responds. E.g. an outsider tries to connect to the VM to which it responds with an outbound packet e.g. TCP RST. At least that should be obvious from the error log, showing the incoming packet first (I just pushed a slight change for that). Maybe also the traffic-from-another-daemon could be obvious - e.g. if there is traffic to |
3c4b203 to
67c6bce
Compare
67c6bce to
1592a7d
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
Exactly. For all we know, the CI VM is firewalled off sufficiently that this can't happen, but we don't know.
Ah yes, as long as it's only some extra logging, having a manual factor in this is fine. It only becomes critical if network traffic would cause a CI failure. i'm not aware of a straightforward way to "log network traffic of this process and subproceses only". Yes, it could be done with a linux network namespace, but that's a lot of hassle.
Seeing this, it might already be namespaced. Though a process namespace doesn't necessarily mean the network namespace is isolated. |
1592a7d to
071e43f
Compare
|
My intention here is to fail the CI because otherwise the log will be buried in the CI output and nobody will notice it. It follows that if this fails randomly with false positives when one would have to investigate it manually for arbitrary PRs which is highly highly highly undesirable. |
|
Ran this on my CI runner which has 8.8.8.8 configured as DNS server for docker. Edit: My understanding is as follows: The DNS requests normally go to a local DNS resolver which then asks an upstream resolver. The upstream resolver (possibly your ISP) indirectly learns that you are running Bitcoin Core tests, even if there was no direct communication over a non-loopback interface. Edit 2: full tcpdump output here: |
This comment was marked as abuse.
This comment was marked as abuse.
|
Concept ACK. Per https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2024-11-26#1069602: "it turns out the owners of 1.2.3.4, 11.22.33.44 and 8.8.8.8, if they would bother, would know the IP address of every dev who runs the functional tests at home." |
071e43f to
8799018
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
8799018 to
803ed46
Compare
|
About the false positives - I think it is worth trying this in its current mode where any detected traffic is assumed to have originated from the tests and fails the CI. If this ever fails the CI for another reason (false positive), then it would be easy to turn this into a "report in the logs only but don't fail" by removing the About the DNS traffic - I did |
39f90a4 to
c652deb
Compare
|
|
|
re-ACK c652deb3c16b7edccb741b9b473502092c0c2638 Just addressed @ryanofsky 's comments. |
|
Not sure why I get this failure, but when installing a fresh Fedora VM, then podman-docker inside that, and then running this pull rebased, I got: For reference, that IP is from Though, |
|
Yea, I have seen the same, on one of my Fedora boxes, which would be a blocker, given it breaks running the CI locally. |
c652deb to
6e0f3a4
Compare
|
It is making requests to the DNS server at bitcoin/src/kernel/chainparams.cpp Lines 596 to 598 in e14451a
|
Yes, I understand this, but I don't understand why the CI is green on this pull request here, but it fails locally when using podman. I guess it could be due to running as root inside the VM. Though, the failure seemingly not being reproducible on every run makes it even more odd. My cmd history today (once it passed, once it failed): |
|
I can't reproduce locally. An attempt to resolve |
Are you sure? Above I tried on Fedora podman-docker, and today it also failed on Ubuntu podman-docker: |
|
Yea. Still failing for me with Podman (5.6.2) on my Fedora box. |
|
From the output, I think there may be several unit tests affected, and they should probably be fixed, even if the GHA CI does not catch this issue. Also, I tried on a fresh Ubuntu VM with a fresh user account (not root) and the issue persists. So I think the issue generally uncovers via podman. |
The test calls: `AppInitMain()` -> `StartMapPort()` -> `StartThreadMapPort()` -> `ThreadMapPort()` -> `ProcessPCP()` -> `PCPRequestPortMap()` -> `CreateSock()` and on the returned value from `CreateSock()` it calls the `Connect()` method. Thus, change `BasicTestingSetup::BasicTestingSetup()` to set `-natpmp` to 0. This way `node_init_tests/init_test` or other tests will not do network activity due to `ThreadMapPort()`. Also add a comment about `natpmp=0` in `test/functional/test_framework/util.py`.
6e0f3a4 to
47f4f65
Compare
|
Is the local failure you observe reproducible or sporadic? If it is reproducible maybe you can nail down which test is making the traffic? Previously to find the offending test I bisected the list of tests (used to nail down the traffic from BUILD/bin/test_bitcoin $(BUILD/bin/test_bitcoin --list_content 2>&1 |(IFS="*" ; while read line ; do line_trim="$(sed 's/^[[:space:]]*//' <<<$line)" ; if [ "$line" = "$line_trim" ] ; then s="$line" ; continue ; fi ; echo "--run_test=$s/$line_trim" ; done |sed -n '1,340p'))This can be substituted for: bitcoin/ci/test/03_test_script.sh Lines 178 to 181 in 1af46cf
|
ryanofsky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 47f4f65. Since last review just rebased to avoid conflicts, added comments to commit message and test, added local to some bash variables
Maybe 50%, see my previous comment: #31349 (comment). I guess this makes bisect a bit harder, but it should be possible by running 10 times, or so. |
|
Still fails for me running the CI (via Podman 5.7.1) on my Fedora box (this PR rebased on master): ALL | ✓ Passed | 6980 s (accumulated)
Runtime: 1243 s
+ traffic_monitor_end functional
+ test_name=functional
++ get_interfaces
++ set -o pipefail
++ ifconfig
++ awk -F ':| ' '/^[^[:space:]]/ { if (!match($1, /^lo/)) { print $1 } }'
++ set +o pipefail
+ for ifname in $(get_interfaces)
++ tcpdump_file functional eth0
++ local test_name=functional
++ local interface_name=eth0
++ echo /tmp/tcpdump_functional_eth0
+ f=/tmp/tcpdump_functional_eth0
+ '[' '!' -e /tmp/tcpdump_functional_eth0 ']'
+ chown root:root /tmp/tcpdump_functional_eth0
++ tcpdump -n -r /tmp/tcpdump_functional_eth0 --direction=out tcp or udp
reading from file /tmp/tcpdump_functional_eth0, link-type EN10MB (Ethernet), snapshot length 262144
<snip>
+ echo 'Error: outbound TCP or UDP packets on the non loopback interface generated during functional tests:'
Error: outbound TCP or UDP packets on the non loopback interface generated during functional tests:
+ tcpdump -n -r /tmp/tcpdump_functional_eth0 tcp or udp
reading from file /tmp/tcpdump_functional_eth0, link-type EN10MB (Ethernet), snapshot length 262144
<snip>
+ exit 1
Command '['./ci/test/02_run_container.sh']' returned non-zero exit status 1.
real 43m11.110s
user 0m55.959s
sys 0m45.018s |
|
This is intended to work on CI and it does so well. Should I reintroduce INTERNET_TRAFFIC_EXPECTED to deal with local runs by making it possible to turn these reports into non-fatal errors? |
Being able to run the CI locally is fully supported and a required use case (there are CI jobs which are not run in the main repo). |
|
I tend to agree with @fanquake. Running the CI locally should be easy and supported. We don't want to end up in a place where the CI is basically just a prayer toward Microsoft/GHA to please run the scripts and to please run them correctly. I think the open questions are:
|
Prevent generating outbound traffic on a non-loopback interface during tests.
node_init_tests/init_test[1111:1111::1]:53. This is achieved by runningtcpdumpduring tests and inspecting its output after the tests. Add--cap-add NET_RAWto docker to be able to runtcpdump.Resolves #31339