Skip to content

Conversation

@vasild
Copy link
Contributor

@vasild vasild commented Jan 16, 2025

Add a new field cpu_load to the output of getpeerinfo RPC.

It represents the CPU time spent by the message handling thread for the given peer, weighted for the duration of the connection. That is, for example, if two peers are equally demanding and one is connected longer than the other, then they will have the same cpu_load number.


Monitoring CPU usage is useful on its own. Also related to #31033.


This PR uses clock_gettime() (POSIX) and GetThreadTimes() (Windows). An alternative to those, should this be explored are: getrusage() (POSIX) and QueryThreadCycleTime() (Windows, but it counts CPU cycles).

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 16, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31672.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept NACK rebroad
Concept ACK sipa, theStack, BrandonOdiwuor, laanwj, mzumsande, 1440000bytes
Stale ACK jonatack, yuvicc

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #34242 (Prepare string and net utils for future HTTP operations by pinheadmz)
  • #32061 (Replace libevent with our own HTTP and socket-handling implementation by pinheadmz)
  • #19461 (multiprocess: Add bitcoin-gui -ipcconnect option by ryanofsky)
  • #19460 (multiprocess: Add bitcoin-wallet -ipcconnect option by ryanofsky)
  • #10102 (Multiprocess bitcoin by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

  • require it be validated -> require that it be validated [missing “that” for clarity]

drahtbot_id_4_m

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/35717681069

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@jonatack
Copy link
Member

Concept ACK if this can be a reliable, useful metric and help pave the path to #31033.

@sipa
Copy link
Member

sipa commented Jan 16, 2025

Concept ACK

@theStack
Copy link
Contributor

Concept ACK

Might be worth noting that this is currently only available on POSIX systems (i.e. not available on Windows, but on all other systems that we support AFAICT) in both the RPC help and a release note.

Copy link
Contributor

@BrandonOdiwuor BrandonOdiwuor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

@fanquake
Copy link
Member

fanquake commented Jan 17, 2025

Seems like there was some very brief discussion in #31033, and the conclusion "I guess this should start with planting some metrics", but I'm not sure putting changes into Bitcoin Core is the right first step.

Before changing our API, it'd be good to atleast show some usage that indicates that the changes here are useful. I assume you've already been running this locally, and collecting the data, so it'd be good to know what it turned up?

As noted in #31033, others have also already conducted similar research (https://b10c.me/projects/023-cpu-usage-of-peers/ or https://delvingbitcoin.org/t/cpu-usage-of-peers/196), and it seems like the data you'd expose here, will be less detailed / useful than what can/has already been produced using tracepoints or similar, so I'm wondering if this is the best approach.

@vasild
Copy link
Contributor Author

vasild commented Jan 17, 2025

@fanquake IMO, monitoring CPU usage is useful on its own, even if we don't start treating peers differently based on that metric (#31033). In other words, this metric is useful at least as much as a bunch of other metrics in the getpeerinfo output.

Yes, I have been running this locally - there is like 65x difference between the least and most demanding peers. I am curious to correlate this to the messages being sent/received to/from those peers and to have a histogram of the data, not just least / most demanding (e.g. histogram of bitcoin-cli getpeerinfo |jq ".[].cpu_load"). I view those as something nice to build on top of this PR, not as a blocker that's needed to justify the usefulness of the CPU time metric.

@1440000bytes

This comment was marked as abuse.

@vasild
Copy link
Contributor Author

vasild commented Jan 20, 2025

Might be worth noting that this is currently only available on POSIX systems (i.e. not available on Windows

Right. GetThreadTimes() looks like a promising Windows alternative. I will try to implement that here. Switching to draft because I will push some work-in-progress a bunch of times.

@vasild vasild marked this pull request as draft January 20, 2025 12:25
@vasild vasild force-pushed the peer_cpu_load branch 3 times, most recently from 3de9c9c to 8a3ec6f Compare January 22, 2025 14:01
@vasild vasild marked this pull request as ready for review January 23, 2025 09:58
@vasild
Copy link
Contributor Author

vasild commented Jan 23, 2025

Ready for review. Implemented for Windows as well.

Here is a test program I played with to test this manually. I think it is not worth adding as a unit test or benchmark, but at least it can live here in the comments of this PR:

cpu_time_windows.cpp
#include <assert.h>

#include <windows.h>
#include <winnt.h>

#include <processthreadsapi.h>

#include <iostream>
#include <thread>

void thread(size_t work)
{
    FILETIME before_creation;
    FILETIME before_exit;
    FILETIME before_kernel;
    FILETIME before_user;
    bool ret = GetThreadTimes(GetCurrentThread(),
        &before_creation,
        &before_exit,
        &before_kernel,
        &before_user);
    assert(ret == 1);

    if (work > 10'000) {
        for (size_t i{0}; i < work; ++i) {
            (void)(i * i);
        }
    } else {
        std::this_thread::sleep_for(std::chrono::milliseconds{work});
    }

    FILETIME after_creation;
    FILETIME after_exit;
    FILETIME after_kernel;
    FILETIME after_user;
    ret = GetThreadTimes(GetCurrentThread(),
        &after_creation,
        &after_exit,
        &after_kernel,
        &after_user);
    assert(ret == 1);

    ULARGE_INTEGER before_kernel_;
    before_kernel_.LowPart = before_kernel.dwLowDateTime;
    before_kernel_.HighPart = before_kernel.dwHighDateTime;

    ULARGE_INTEGER before_user_;
    before_user_.LowPart = before_user.dwLowDateTime;
    before_user_.HighPart = before_user.dwHighDateTime;

    ULARGE_INTEGER after_kernel_;
    after_kernel_.LowPart = after_kernel.dwLowDateTime;
    after_kernel_.HighPart = after_kernel.dwHighDateTime;

    ULARGE_INTEGER after_user_;
    after_user_.LowPart = after_user.dwLowDateTime;
    after_user_.HighPart = after_user.dwHighDateTime;

    ULARGE_INTEGER elapsed;
    elapsed.QuadPart = after_kernel_.QuadPart + after_user_.QuadPart - before_kernel_.QuadPart - before_user_.QuadPart;
    ULONGLONG elapsed_ns{elapsed.QuadPart * 100};
    std::cout << "cpu time: " << elapsed_ns / 1'000'000'000.0 << " sec\n";
}

int main ()
{
    std::thread t1{thread, 1000000000};
    std::thread t2{thread, 1000000000};
    std::thread t3{thread, 3000};
    t1.join();
    t2.join();
    t3.join();

    return 0;
}

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(left two nits, feel free to ignore)

@vasild
Copy link
Contributor Author

vasild commented Jan 23, 2025

28b2a67f13...0f68c47e93: address suggestions by @maflcko

@maflcko
Copy link
Member

maflcko commented Jan 23, 2025

The code here assumes that Bitcoin Core is single threaded, which may be true right now, but could change in the future.

@vasild
Copy link
Contributor Author

vasild commented Apr 17, 2025

bb822a5ee1...ee16345af3: address suggestions

@vasild
Copy link
Contributor Author

vasild commented Apr 17, 2025

ee16345af3...19c8336d97: address suggestions

@jonatack
Copy link
Member

jonatack commented Apr 19, 2025

Latest changes since my last review LGTM. I plan to review the code more closely.

Empirically, running a node on this branch, it looks like over time most peers settle into less than 0.5 per milles, a few are higher, with maybe one peer around 4. These higher-load connections seem to more often be outbound ones.

(Edit, updated the branch with the -netinfo commit with a few improvements and to also only display cpu load for a peer (rounded to nearest integer) if it is non-zero (e.g. >= 0.5 in getpeerinfo), to more easily see which peers are using more CPU: https://github.com/jonatack/bitcoin/commits/2025-04-add-cpu_load-to-netinfo/

Example 1

Screenshot 2025-04-19 at 12 54 32

Example 2 (a couple weeks later)

Screenshot 2025-05-03 at 12 51 29

@vasild
Copy link
Contributor Author

vasild commented May 13, 2025

19c8336d97...8b8b854346: address suggestions and take the bitcoin-cli change from https://github.com/jonatack/bitcoin/commits/2025-04-add-cpu_load-to-netinfo/, thanks!

luke-jr pushed a commit to luke-jr/bitcoin that referenced this pull request Jun 6, 2025
Add a new field `cpu_load` to the output of `getpeerinfo` RPC.

It represents the CPU time spent by the message handling thread for the
given peer, weighted for the duration of the connection. That is, for
example, if two peers are equally demanding and one is connected longer
than the other, then they will have the same `cpu_load` number.

Github-Pull: bitcoin#31672
Rebased-From: 8571fbb123dc91cab6903d22881da874ad91fcef
luke-jr pushed a commit to luke-jr/bitcoin that referenced this pull request Jun 6, 2025
Github-Pull: bitcoin#31672
Rebased-From: 8b8b85434653f66a275a1f757d49a719847b57e5
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task tidy: https://github.com/bitcoin/bitcoin/runs/42137973271
LLM reason (✨ experimental): The CI failure is caused by clang-tidy detecting a deprecated header include in the code.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

vasild and others added 2 commits June 20, 2025 09:59
Add a new field `cpu_load` to the output of `getpeerinfo` RPC.

It represents the CPU time spent by the message handling thread for the
given peer, weighted for the duration of the connection. That is, for
example, if two peers are equally demanding and one is connected longer
than the other, then they will have the same `cpu_load` number.
@vasild
Copy link
Contributor Author

vasild commented Jun 20, 2025

8b8b854346...b25b40ebd5: rebase and pet tidy: time.h -> ctime

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Jul 17, 2025
Add a new field `cpu_load` to the output of `getpeerinfo` RPC.

It represents the CPU time spent by the message handling thread for the
given peer, weighted for the duration of the connection. That is, for
example, if two peers are equally demanding and one is connected longer
than the other, then they will have the same `cpu_load` number.

Github-Pull: bitcoin#31672
Rebased-From: f1517d9
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Jul 17, 2025
@rebroad
Copy link
Contributor

rebroad commented Jul 28, 2025

NACK. cpu load alone is a useless metric. However, CPU load divided by TXs accepted into the chain (or mempool) could be a useful metric.

RPC
---

A new field `cpu_load` has been added to the `getpeerinfo` RPC output. It
Copy link
Contributor

@ajtowns ajtowns Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why monitor cpu time rather than real time? Isn't time spent waiting for disk access while looking up the utxo set equally problematic? Waiting on another thread's lock is arguably less problematic (it's the other thread that's delaying things), but if an attacker is deliberately triggering lock contention somehow to delay processing other thread's messages, that would still be a problem.

Adding OS-specific logic rather than just using either real time or a steady clock seems over-complicated here.

[EDIT: I guess in the context of rate-limiting expensive peers, it would be better to have a rolling average of time used for each peer, so that having sent expensive txs five hours ago has ~0 impact on how much cpu time you're getting now]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.