Skip to content

Conversation

@Crypt-iQ
Copy link
Contributor

@Crypt-iQ Crypt-iQ commented Sep 3, 2025

Posting up to get feedback, there are some design flaws with the approach in this PR. Coverage is here (look in src/blockencodings.cpp, relevant compact block bits in src/net_processing.cpp).

This harness can make (in)valid blocks, reconstruct blocks with in-mempool txns, mark peers as HB, and has high stability in AFL++ (~98-99%).

The main downside is that there are filesystem operations. In the .init function initialize_cmpctblock, a chain of 200 blocks is created. Each fuzzing iteration then copies this statically-named, "cached" data directory to a temporary directory that gets deleted at the end of the iteration. If each fuzzing iteration instead mines its own chain, the execs/s slows down to a crawl (~0.5/s or less, which would also make CI runs really slow).

@DrahtBot DrahtBot added the Tests label Sep 3, 2025
@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 3, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33300.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK l0rinc, frankomosh

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #33740 (RFC: bench: Add multi thread benchmarking by fjahr)
  • #29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@Crypt-iQ Crypt-iQ force-pushed the cmpctblock-fuzz-0807-fs branch 2 times, most recently from ea293e3 to c426b80 Compare September 3, 2025 22:45
@maflcko
Copy link
Member

maflcko commented Sep 4, 2025

There does not seem to be a way without signal handlers to delete the static, cached datadir when fuzzing is complete. Another issue is that multi-core fuzzing via AFL++ does not work because each worker will try to wipe the cached datadir (created via -testdatadir which wipes if the directory exists) which will cause other workers to crash if they are trying to copy it. I could not figure out a way for each worker to have their own cached datadir.

Is my understanding correct that AFL will run init (create the static dir), then fork into several fuzz processes, which run for N iterations and then shut down and delete the static dir? (On the next re-fork, it will fail to find the static dir)?

If yes, a solution could be to lazy-init the static dir after init, after the fork. Also, you may have to use the pid (or os-rand) to name the static folder to avoid collisions.

As for deleting the dirs, maybe it is possible to spin up several testing setups at the same time:

  • One dummy testing setup to create a root dir, which is cleaned up
  • One testing setup living inside that root dir for the static folder (in a subfolder)
  • One testing setup living inside the root dir with a copy of the static folder

(Not sure if this is possible, though)

@dergoegge
Copy link
Member

Is my understanding correct that AFL will run init (create the static dir), then fork into several fuzz processes, which run for N iterations and then shut down and delete the static dir? (On the next re-fork, it will fail to find the static dir)?

We removed the __AFL_INIT call, so it will actually fork prior to init, so I think all that's needed is to randomize the static dir name as you suggested as well.

@Crypt-iQ
Copy link
Contributor Author

Crypt-iQ commented Sep 4, 2025

Is my understanding correct that AFL will run init (create the static dir), then fork into several fuzz processes, which run for N iterations and then shut down and delete the static dir? (On the next re-fork, it will fail to find the static dir)?

My description was a bit vague. Since the harness uses -testdatadir in init to create the static datadir, it does not get deleted in the TestingSetup destructor (by design of -testdatadir). Any time init is called, it will wipe the static datadir if it already exists. This can happen after N iterations (100,000 currently) or with a newly spawned worker after forking since the fork point is prior to init like @dergoegge mentioned. If the datadir is deleted while a fuzz iteration tries to copy the non-existent directory, it will crash.

a solution could be to lazy-init the static dir after init, after the fork

I considered this, but I think this introduces some (slight) non-determinism as the directory may/may not exist?

Also, you may have to use the pid (or os-rand) to name the static folder to avoid collisions.

Yup, I agree. I think this randomization can only be done if the static datadir can be deleted at the end of fuzzing, otherwise there will be lots of these lying around?

As for deleting the dirs, maybe it is possible to spin up several testing setups at the same time:

Why is the dummy testing setup needed in this example? Could it instead just be two testing setups (one for the static datadir, one for the copied datadir)? I considered something similar to this as well where there is a static TestingSetup that gets destructed at the end of fuzzing (and wipes the static datadir) and one for each fuzz iteration (that gets destructed at the end of the iteration), but I was worried that the static TestingSetup would introduce more non-determinism? For example, what if this static TestingSetup has scheduling going on or is using the static datadir while we try to copy from it?

@Crypt-iQ
Copy link
Contributor Author

Crypt-iQ commented Sep 5, 2025

I was able to get AFL++ multi-core fuzzing to work by adding a static FuzzedDirectoryWrapper instance that deletes the passed path in its destructor as well as using a random element to each statically named datadir. Each instance is in the high 98-99% stability and does not crash after hitting 100,000 iterations (yay). I was not able to use multiple TestingSetups at the same time without causing several panics due to assertions about globals. I will fix the lint, tidy jobs and also see if the deterministic-fuzz-coverage issues still pop up.

@Crypt-iQ Crypt-iQ force-pushed the cmpctblock-fuzz-0807-fs branch from c76f475 to 0de9143 Compare September 9, 2025 14:49
@Crypt-iQ
Copy link
Contributor Author

Crypt-iQ commented Sep 9, 2025

There is non-determinism here because m_dirty_blockindex is std::set<CBlockIndex*> and sorts based on the pointer. This can be fixed by adding a function object that compares the block hashes. However, I did not know if this should be added just for fuzz code and I have not run benchmarks.

After patching the above non-determinism locally, there is also non-determinism in the scheduler because this fuzz test uses the scheduler. I am guessing this is due to the RegisterValidationInterface call in the fuzz test notifying when transactions are being added or removed, and blocks being mined.

@Crypt-iQ Crypt-iQ marked this pull request as ready for review September 9, 2025 16:40
Copy link
Member

@dergoegge dergoegge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, overall approach looks good to me

@Crypt-iQ Crypt-iQ force-pushed the cmpctblock-fuzz-0807-fs branch 4 times, most recently from dbc911f to e2f9214 Compare September 18, 2025 16:08
@Crypt-iQ
Copy link
Contributor Author

Crypt-iQ commented Sep 18, 2025

The latest push adds two commits:

  • a5619ea631bd8b93b4ef02a20abb8c1c0705d8e4 "test: add setup_validation_interface_no_scheduler to TestOpts"
    • Disables the scheduler completely if set. This is needed because this harness creates a TestingSetup inside FUZZ_TARGET and scheduling a promise can be non-deterministic as the scheduler's serviceQueue may start with a non-empty taskQueue. This is not an issue in the other fuzz tests because their TestingSetup is created in .init and ResetCoverageCounters is called after.
  • e2f921458913bcbbe74115cdb2174b0ab31784f2 "node: sort m_dirty_blockindex by block hash"
    • I am ok with this commit being removed and want to know what others think. This sorts by block hash rather than memory address and does introduce a slow-down in production code for no production benefit (I think because memory addresses are ~generally going to be increasing while inserting into m_dirty_blockindex, whereas sorting by block hash won't). I added it to show the change needed to make this harness fully non-deterministic. It is also possible to add an #ifdef so that it doesn't negatively affect production code, but I also don't want to set a precedent and litter the codebase with fuzz-specific macros. The sorting is needed as otherwise leveldb's MemTable will be non-deterministic across runs. I tried some alternatives, but this is the best I could come up with that made the determinstic-fuzz-coverage script happy.

@Crypt-iQ Crypt-iQ force-pushed the cmpctblock-fuzz-0807-fs branch from e2f9214 to ed813c4 Compare October 2, 2025 16:33
@Crypt-iQ
Copy link
Contributor Author

Crypt-iQ commented Oct 2, 2025

e2f921458913bcbbe74115cdb2174b0ab31784f2 -> ed813c4:

  • Fixes an issue pointed out by @marcofleon with the PoW checks failing
  • Adds more descriptive commit messages

Copy link
Contributor

@marcofleon marcofleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some possible discussion about ed813c4 for review club.


std::vector<CNode*> peers;
auto& connman = *static_cast<ConnmanTestMsg*>(setup->m_node.connman.get());
for (int i = 0; i < 3; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed a bit offline, increasing the number of peers here to greater than 3 should be enough to hit some of the missing "eviction" logic in MaybeSetPeerAsAnnouncingHeaderAndIDs. Maybe increasing to 8 peers would be fine? Although not sure if that would test anything more (somewhere else) than if it were 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will compare the coverage from increasing to 4 vs increasing to 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither 4 nor 8 helped hit the case, will look into it more.

Copy link
Contributor

@marcofleon marcofleon Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking I'd say, so could be a followup once it's figured out.

Strange though, now I'm curious.

Copy link
Contributor Author

@Crypt-iQ Crypt-iQ Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think outbound connections are getting disconnected for some reason, latest coverage is here. On a local branch, I modified the harness to return early if any of the test nodes did not complete the version-verack handshake, but coverage is still missing.

@Crypt-iQ Crypt-iQ force-pushed the cmpctblock-fuzz-0807-fs branch 2 times, most recently from a1b58a2 to f2ce362 Compare October 30, 2025 16:13
@Crypt-iQ
Copy link
Contributor Author

Crypt-iQ commented Oct 30, 2025

Latest push ed813c4 -> f2ce362:

  • implements GetStrongRandBytes per comment
  • modifies create_tx to choose a mempool UTXO only sometimes instead of most of the time
  • modifies create_block to generate more than 2 non-coinbase transactions per feedback from review club

Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept and approach ACK, thanks for tackling this.

I like how you've extracted the first few commits that I could quickly get out of the way to continue to the more difficult ones - which are a bit too difficult though, it would help me personally if we could split it into smaller functional chunks, so that the reviewers are guided through the change. The commit messages could give extra context where needed. I don't think we should split by functions, but rather functionalities so that each commit could theoretically be merged if needed - we shouldn't depend on a future commit to understand a previous one.

I have provided some hints and some further context for the LevelDB related changes. I think we should do the set stabilization in a dedicated PR, but you can also adjust it here so that it likely performs similarly to the original.


// Disallow implicit std::string conversion for copy to avoid locale-dependent
// encoding on Windows. This is currently only used in fuzzing.
static inline void copy(const path& from, const path& to, copy_options options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8515594

Do we still need this after #33550?

This is currently only used in fuzzing.

Are we adding dead code in this commit? It's hard to judge if a solution is accurate if we see it before the problem is presented - can we make the question obvious before we provide the answer?

And more concretely: my IDE is confused about the usage of this even at the tip - would it be possible to invalidate the conversions you don't want instead, maybe something like:

static void copy(const std::string&, const path&, copy_options) = delete;
static void copy(const path&, const std::string&, copy_options) = delete;
static void copy(const std::string&, const std::string&, copy_options) = delete;

nit: inline is implicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this after #33550?

I think it's still needed, I added it to get rid of linter warnings if std::filesystem::copy is instead called directly from the test code.

Are we adding dead code in this commit? It's hard to judge if a solution is accurate if we see it before the problem is presented - can we make the question obvious before we provide the answer?

Fair point, I'll rework the commits so it's a bit easier to tell where things are used.

would it be possible to invalidate the conversions you don't want instead, maybe something like

Yes, that also works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to invalidate the conversions you don't want instead, maybe something like

Currently implementing this and wondering if this is necessary; can you share what your IDE is confused about? The constructor of path that takes a std::string is deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have Windows, it's why I asked if instead of adding a new method we can just delete the conversions, somewhat similarly to the mentioned https://github.com/bitcoin/bitcoin/pull/33550/files#diff-69423eb01bf14b3bd0d930c0b3e1fd6f4f061ffefacab579053eaa734fc22f38R65 (since the error seemed superficially similar to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the conversions instead of adding a new method does not work because overload resolution considers these functions even though they are deleted and then the compiler errors that copy is ambiguous:

/Users/eugenesiegel/btc/bitcoin-clone/src/test/util/setup_common.cpp:210:9: error: call to 'copy' is ambiguous
  210 |         fs::copy(cached_dir, m_path_root, fs::copy_options::recursive);
      |         ^~~~~~~~
/Users/eugenesiegel/btc/bitcoin-clone/src/util/fs.h:142:13: note: candidate function has been explicitly deleted
  142 | static void copy(const path&, const std::string&, copy_options) = delete;
      |             ^
/Users/eugenesiegel/btc/bitcoin-clone/src/util/fs.h:141:13: note: candidate function has been explicitly deleted
  141 | static void copy(const std::string&, const path&, copy_options) = delete;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inline is implicit

clang warns about unused functions if I remove the inline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this helper appears to be used only by fuzz test code, is there any downside to keeping it in a test only location(i.e under src/test/util) instead of src/util/fs.h? Maybe that could help keep util’s interface focused on production use, but happy to defer if there’s a reason it needs to live here specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be here because of the linter which checks that std::filesystem isn't used outside of this fs namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess the linter could be modified. I'll need to think a bit more on this. Ideally this wouldn't live in production code if it's not used in production.

promise.get_future().wait();
}
} else if (opts.setup_validation_interface_no_scheduler) {
m_node.validation_signals = std::make_unique<ValidationSignals>(std::make_unique<util::ImmediateTaskRunner>());
Copy link
Contributor

@l0rinc l0rinc Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same problem in 71719d172e3a435dd983293fc9da6a08974f8b01 - we're introducing dead code, I don't have a way to judge if this is indeed the correct solution, since up to this point there wasn't any pain that this alleviates.
Looking at the commit I see that setup_validation_interface_no_scheduler is always false in this commit therefore we can delete the code - and I have to check the next commits and come back here to reevaluate that.
Can we add a simpler fuzz test which needs this in the same commit that introduces it - and extend the test with other functionality separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a simpler fuzz test which needs this in the same commit that introduces it - and extend the test with other functionality separately.

Yes, I'll do some variation of this. I think it might be cleaner to put it as one of the last commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0190ac6

CBlockHeaderAndShortTxIDs base_cmpctblock = cmpctblock;
net_msg = NetMsg::Make(NetMsgType::CMPCTBLOCK, base_cmpctblock);
},
[&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is huge, it's hard to review it meaningfully. Can we separate some of these cases to independent commits that can provide more context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll incrementally add the cases so it's a bit easier to digest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest push to 039c3aa

{
argsman.AddArg("-testdatadir", strprintf("Custom data directory (default: %s<random_string>)", fs::PathToString(fs::temp_directory_path() / TEST_DIR_PATH_ELEMENT / "")),
ArgsManager::ALLOW_ANY, OptionsCategory::DEBUG_TEST);
argsman.AddArg("-fuzzcopydatadir", "Copies the passed data directory to a temporary directory to use during a single fuzz iteration", ArgsManager::ALLOW_ANY | ArgsManager::DEBUG_ONLY, OptionsCategory::DEBUG_TEST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5a0f0c3 commit is also quite big, can we split by functionality instead? I would prefer the commits telling a story, to each make be self-contained as much as possible, to make sense on their own as far as it's reasonable. Can we split this one by features, e.g. adding rand_path to be able to specify strong random, maybe fuzzcopydatadir param and usage next, EnableFuzzDeterminism last - or something similar

Copy link
Contributor Author

@Crypt-iQ Crypt-iQ Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f07ef41 which introduces rand_path using strong randomness, and -fuzzcopydatadir and uses both in the same commit. The EnableFuzzDeterminism change is done in 039c3aa.

- Adds a std::filesystem::copy wrapper to fs that allows copying an
  entire directory.
- Adds a -fuzzcopydatadir argument to ArgsManager that accepts a path
  to a cached data directory. BasicTestingSetup() copies everything in
  this data directory over to a new directory and ~BasicTestingSetup
  deletes the new directory.
- Adds a compact block harness that creates a data directory during setup
  and is copied (and deleted!) each iteration via -fuzzcopydatadir. Currently,
  the harness only sets mock time in a loop.
The cmpctblock harness now adds peers and processes the sendcmpct message.
Also mine 200 blocks in initialize_cmpctblock so that the transactions
have coinbases they can spend.
The blocks may include mempool and non-mempool transactions.
Also move FinalizeHeader from p2p_headers_presync.cpp to util.h so
that the mining function can use it to create valid headers.
Using an existing block or a newly created block, send a compact
block message to the target node. Some of the time, the harness
will prefill multiple transactions from the block. Adds a
FuzzedCBlockHeaderAndShortTxIDs class that is used to populate the
protected prefilledtxn and shorttxids fields.
Sometimes, blindly take an existing block and choose random
transactions to request in a blocktxn message.
This disables the scheduler completely if set and is mutually exclusive
with setup_validation_interface. Since the cmpctblock harness creates
a TestingSetup inside FUZZ_TARGET, scheduling a promise (as
setup_validation_interface does) suffers from non-determinism since
the scheduler's serviceQueue may start with a non-empty taskQueue. The
other fuzz harnesses do not have this issue since they create TestingSetup
in .init and ResetCoverageCounters is called after.
…datadir

- Removing the random path element allows the cmpctblock harness to know the
  exact path to pass to -fuzzcopydatadir.
- Removing the "Test directory (will not be deleted) ..." log line avoids
  spamming stdout when fuzzing.
@Crypt-iQ Crypt-iQ force-pushed the cmpctblock-fuzz-0807-fs branch from f2ce362 to 039c3aa Compare November 21, 2025 20:42
@Crypt-iQ
Copy link
Contributor Author

Latest push f2ce362 -> 039c3aa breaks up the harness into more reviewable chunks. I forgot to add @stringintech as co-author on a commit during rebase, will address the next time I push up.

Copy link
Contributor

@frankomosh frankomosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. fuzzing direction look reasonable.


// Disallow implicit std::string conversion for copy to avoid locale-dependent
// encoding on Windows. This is currently only used in fuzzing.
static inline void copy(const path& from, const path& to, copy_options options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this helper appears to be used only by fuzz test code, is there any downside to keeping it in a test only location(i.e under src/test/util) instead of src/util/fs.h? Maybe that could help keep util’s interface focused on production use, but happy to defer if there’s a reason it needs to live here specifically.

if (fuzzed_data_provider.ConsumeBool() && mempool_size != 0) {
size_t random_idx = fuzzed_data_provider.ConsumeIntegralInRange<size_t>(0, mempool_size - 1);
LOCK(setup->m_node.mempool->cs);
outpoint = COutPoint(setup->m_node.mempool->txns_randomized[random_idx].second->GetSharedTx()->GetHash(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like txns_randomized reorders between runs, could this cause the same fuzz seed to pick different transactions? If so, would it make sense to snapshot the vector first for stable indexing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think txns_randomized gets reordered between runs, its randomness depends on the order that inserts/deletes happen in the mempool which is deterministic. I've also checked that this harness is fully deterministic with the deterministic-fuzz-coverage script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the harness is not fully deterministic because I dropped a commit that will be done in a follow-up which causes some LevelDB non-determinism. But it is deterministic as far as the usage of txns_randomized goes.

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants