profiling: add benchmark and profiling scripts#406
Conversation
A benchmark/profiling script for local development or a CI helps to catch performance regressions early and find spots for optimization. The benchmark setup consists of a cargo test that reuses the test infrastructure to forward localhost connections. This test is skipped by default unless an env var is set. The benchmark load comes from a fortio server and client for HTTP/gRPC req/s latency measurement and from an iperf server and client for TCP throughput measurement. In addition to the fortio UI to inspect the benchmark data, the results are also stored to a summary text file which can be used to plot the difference of the summary results of, e.g., two git branches. The profiling setup is the same as above but also runs "perf" or "memory-profilier" to sample the call stacks at either runtime or on heap allocation calls. This requires a special debug build with optimizations, that can be generated with a build script. The results can be inspected as interactive flamegraph SVGs in the browser. Please follow the instructions in the profiling/README.md file on how the scripts are used. Signed-off-by: Kai Lüke <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Since the profiling binary is no longer run as a Rust test, we don't need to silently skip it when the required env var is unset. Instead, we can actually print that we expected it to be set and exit early. Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
|
I've done some manual testing of the profiling scripts added in this branch, and confirmed that everything seems to work — I've even generated some potentially interesting flamegraphs. |
|
@hawkw Is it feasible to put the binary in its own crate (i.e. |
|
The profiling script needs to ensure that |
|
Trying to invoke fortio, the profile script hangs. Our invocation appears to use a deprecated flag: |
hmm, this worked for me on ares — what version of fortio do you have installed? on ares, I see: which i suspect is probably old? |
|
I'm using the most recent release from fortio:
:; fortio version
1.3.1 2019-02-02 14:08 fd8f4a7177e9ea509f27105ae4e55e6c68ece6f7 go1.11.5
Which is a bit older than I imagined, but... okay. I'll install from source
I guess
…On Tue, Jan 14, 2020 at 12:00 PM Eliza Weisman ***@***.***> wrote:
Trying to invoke fortio, the profile script hangs. Our invocation appears
to use a deprecated flag:
:; fortio load -resolve 127.0.0.1 -c=4 -qps=4000 -t=10s -payload-size=10 '-labels=eliza-benchmark-and-profiling 2020Jan14_19h28m54s Iter: 1 Dur: 10s Conns: 4 Streams: 4' -json=http1outbound_bench-4000-rps.2020Jan14_19h28m54s.json -keepalive=false -H 'Host: transparency.test.svc.cluster.local' localhost:4140
flag provided but not defined: -resolve
hmm, this worked for me on ares — what version of fortio do you have
installed? on ares, I see:
***@***.***:~$ fortio -version
1.3.2-pre unknown go1.10.2
which i suspect is probably old?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#406?email_source=notifications&email_token=AAB2YYS62OXFZ5ZHCBINHW3Q5YKVFA5CNFSM4KGJLH72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI55OBA#issuecomment-574347012>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB2YYRS2TB7AZ3L5RXSIZLQ5YKVFANCNFSM4KGJLH7Q>
.
--
Oliver Gould <[email protected]>
|
|
I wonder if we can dockerize the test harness. Or we could use something like Pex; but it would be great to figure out how to bundle dependencies for things like plot.py (or replace with something else that we can bundle better) |
yeah, I'm trying to figure out a nicer solution for that...i don't know a whole lot about Python & how to bundle python deps with a script, but I'm doing some research. |
|
Another issue I've run into while testing this:
```
profiling/profiling-perf-fortio.sh: line 52: 32065 Segmentation fault
(core dumped) PROFILING_SUPPORT_SERVER="127.0.0.1:$SERVER_PORT" perf
record -F 2000 --call-graph dwarf $LINKERD_TEST_BIN &> "$LOG"
+ echo 'proxy failed'
proxy failed
+ perf script
+ inferno-collapse-perf
WARNING: The perf.data file's data size field is 0 which is unexpected.
Was the 'perf record' command properly terminated?
profiling/profiling-perf-fortio.sh: line 52: 32118 Segmentation fault
(core dumped) perf script
32119 Done | inferno-collapse-perf > "out_$NAME.$ID.folded"
+ killall iperf fortio
```
I think this means that the profile binary is segfaulting? I'll need to try
manually running it to verify
…On Tue, Jan 14, 2020 at 12:41 PM Eliza Weisman ***@***.***> wrote:
I wonder if we can dockerize the test harness. Or we could use something
like Pex; but it would be great to figure out how to bundle dependencies
for things like plot.py (or replace with something else that we can bundle
better)
yeah, I'm trying to figure out a nicer solution for that...i don't know a
whole lot about Python & how to bundle python deps with a script, but I'm
doing some research.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#406?email_source=notifications&email_token=AAB2YYWIDPQNSAJRZNNJ4TLQ5YPOVA5CNFSM4KGJLH72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI6BSKQ#issuecomment-574363946>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB2YYRDJXNTYK5E3YZBFHTQ5YPOVANCNFSM4KGJLH7Q>
.
--
Oliver Gould <[email protected]>
|
|
That segfault was a config issue on my host. Seems to work properly after fixing it. |
|
@olix0r which, if any, of these changes would you prefer we make on this branch, and which ones can be done in follow-up PRs? BTW, I've made some changes to the plot script in a second branch, it now produces error bars based on the (fortio-reported) standard deviation. |
|
@hawkw Basically, I care that we know how to run these scripts on arbitrary linux hosts. I'm still having some problems running the profiling scripts successfully. I'll need to dig in a little more to understand what's wrong. Once I've run these scripts on my laptop, I think we can merge and address feedback in followups. |
|
Okay, great. I'm working on dockerizing the plotting script...but I don't think that will help with the profiling scripts, given that perf needs to be installed on the host... |
|
@hawkw i suspect that we could dockerize perf/iperf etc if we gave the container SYS_ADMIN privileges. Not a hard blocker but I think it's worth investigating |
The first few profile exercises appear to work okay; but the third test appears to fail (Note the terminated line). Nothing progresses after this for me... |
|
I'm pretty sure the "Terminated..." line is normal — when I run the benchmarks, I see that at the end of each run. I think this is just how the benchmark script kills the test server: |
Signed-off-by: Eliza Weisman <[email protected]>
|
Ah, the test isn't hung indefinitely. It's just that some of the |
|
@hawkw out of curiosity about how long does it take you to run a profile? The actual test invocations seem pretty quick, but the I wonder if (as a followup) we can parallelize the image generation after all of the profiles have been run? |
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
i've dockerized the plotting script in 88e84eb and added a bash script to run it in docker.
running a profiling run on ares, i'll get back to you on how long it takes when it finishes! running perf is definitely not fast though, the benchmark run without perf is pretty quick in my experience. |
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
|
Okay, I think this is ready for another (hopefully final!) look — I've fixed a couple of last bugs with running the tests in Docker, and updated the README. |
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
|
Hmmm... when I try to run either benchmark.sh or profiling-heap.sh, I don't actually see a proxy get started, and both hang. Also, is it not possible to run these scripts from the repo root? It looks like it's intended to work but I get some errors like: |
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
|
@olix0r I believe my most recent changes have fixed the issues with proxies not starting on Linux (due to permissions for ssh keys 🙃) + the path wackiness. |
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
1b8b870 to
2e33c75
Compare
kleimkuhler
left a comment
There was a problem hiding this comment.
This looks good to me! There has been a lot of discussion already, but from the benchmark.sh and profiling-heap.sh scripts on my machine (Mac) everything works as expected from following the directions in the README.
I confirmed with @hawkw that running profiling-perf.sh is probably going to result in errors due to perf not being available.
I was able to analyze the .svgs and confirmed everything cleans up well after.
profiling/README.md
Outdated
| For example: | ||
|
|
||
| ```console | ||
| $ ITERATIONS=2 DURATION=2s CONNECTIONS=2 GRPC_STREAMS=2 HTTP_RPS="100" GRPC_RPS="100 1000" REQ_BODY_LEN="100 8000" ./benchmark-cargo-test-fortio.sh |
There was a problem hiding this comment.
This should now be .. ./benchmark.sh
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
This is essentially @pothos' PR #278, but rebased & updated to work with
the current master. In addition, I've changed the profiling proxy to be
run as a separate bin target (run with
cargo run --bin profile) ratherthan a test case that does nothing on most test runs and runs the proxy
in profiling mode when an env var is set.
Description from the original PR:
Closes #278.
Signed-off-by: Eliza Weisman [email protected]
Co-authored-by: Kai Lüke [email protected]