Skip to content

Conversation

@hodlinator
Copy link
Contributor

@hodlinator hodlinator commented Feb 21, 2025

  • Log HTTP request pointers to be able to discover which one is stuck.
  • Instead of hanging indefinitely, use timeouts and abort when HTTP connections take unreasonably long to close.
  • Ensure we always free libevent HTTP.

Should help debug #31894.

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 21, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31929.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32061 (Replace libevent with our own HTTP and socket-handling implementation by pinheadmz)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@fanquake
Copy link
Member

Assertion failed: (evb), function WriteReply, file httpserver.cpp, line 682.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/http_request/83149a7e3abd937c7bc67fa55b9a4fc9338e3d8c"

⚠️ Failure generated from target with exit code 1: ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/http_request')]

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/37612857743

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@hodlinator hodlinator force-pushed the 2025/02/stop_http_robust branch from ef290c0 to 24558c2 Compare February 21, 2025 23:32
Copy link
Contributor Author

@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fuzz-failure was me underestimating a default-initialized argument. Fixed in 2nd push.

@laanwj
Copy link
Member

laanwj commented Mar 18, 2025

From what i remember this was extrememly hard to get right with libevent (or at least our use of it), without leaking anything or running into race conditions. There have been a few tries to fix the behavior over the years, but they all were different compromises, adding ever more compexity, and often introducing new problems.

As it seems to be where things are heading anyway (#32061) maybe it's better to focus on replacing libevent and its webserver framework wholesale.

@hodlinator
Copy link
Contributor Author

hodlinator commented Mar 18, 2025

Thank you for providing your perspective @laanwj!

In working on this PR, I've also gained an appreciation for getting rid of libevent. However, I'm not sure when that will happen, maybe not until post-30.0. In the meantime we are seeing issues on CI such as #31894 (analysis).

What this PR is doing is adding our own request ids to track which HTTP request never finishes. When we have stuck HTTP requests, it also makes us abort the process with a clear error after 10min30s instead of waiting for the test framework to time out after 40min. On top of that it also fixes an intermittent leak, right after we've joined on the thread that is the only one to sometimes perform deletion.

Have come across #26742 and #19420 while working on this, so have seen some of the struggle.

Don't have an overall grasp on how frequent issues like 31894 are on CI (or in the wild), it's fair that that frequency should influence the priority of this compared to other work.

Could aid debugging when an HTTP request gets stuck at shutdown.
@hodlinator hodlinator force-pushed the 2025/02/stop_http_robust branch from 24558c2 to d8c8cc1 Compare October 14, 2025 20:37
Could happen when libevent HTTP connection doesn't complete for some reason.

Output early warning to give clue about what's up.

Prior check+log message before even starting to wait was a bit too eager to hint at something being wrong.
There is a race between ThreadHTTP exiting and our queuing of the freeing-event (event_base_once).

Additional free was useful in a majority of runs - 709/962.

Memory leaks during shutdown are not a big issue (unless growing unbounded) as the OS will clean up after the process, but still nice to tidy up.

Can be reproduced by on this commit by running something like:

```shell
#!/usr/bin/env bash

mv ~/.bitcoin/debug.log ~/.bitcoin/debug.log.old

# Allows overriding total through `$ ./repeat.sh <TOTAL>`
TOTAL=${1:-100}

for ((i=1; i<=$TOTAL; i++)); do
	echo "Repetition $i/$TOTAL"
	./build/bin/bitcoind -daemonwait -noconnect -debug=libevent -debug=http -debug=rpc
	./build/bin/bitcoin-cli stop
	if [ $? -ne 0 ]; then
		echo "Exit code $?"
		exit
	fi

	while [ -f "$HOME/.bitcoin/bitcoind.pid" ]; do
		echo "PID file still exists"
		sleep 0.1
	done
done

echo "Leak averted in $(grep "Freeing eventHTTP-event" ~/.bitcoin/debug.log -c)/$(grep "Bitcoin Core version" ~/.bitcoin/debug.log -c) cases."
```
@hodlinator hodlinator force-pushed the 2025/02/stop_http_robust branch from d8c8cc1 to aa11ee5 Compare October 14, 2025 20:49
@hodlinator
Copy link
Contributor Author

Spent some time revisiting and polishing this since 24558c2cf18210f46d6e2fadf0c5c5912f4b8e10.

  • Rebased on latest master.
  • Dropped the comment changes from 1350087f9a as verifying them is added work and I'm not sure they were 100% correct.
  • Dropped the introduction of our own request ids in 24558c2cf1.
  • Merged the the remaining log message on HTTP connection close from 1350087f9a + now log request pointers instead of ids in 028301e, "http: Include request ptr in initial message and log close + reply". (+moved it first in the PR).
  • Combined HTTPRequestTrackers CountActiveConnections() and WaitUntilEmpty().
  • Added reproducer-script to commit message of memory leak fix aa11ee5.

@hodlinator
Copy link
Contributor Author

Was worried that libevent would clean up the memory leak I thought was fixed by aa11ee5 later during shutdown, so verified that that libevent does no such thing using Valgrind: master...hodlinator:bitcoin:2025/02/stop_http_robust.valgrind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants