Skip to content

perf: various performance optimizations to hot spots (8.75x speedup)#253

Open
Rongronggg9 wants to merge 4 commits intoilikenwf:masterfrom
Rongronggg9:perf/various
Open

perf: various performance optimizations to hot spots (8.75x speedup)#253
Rongronggg9 wants to merge 4 commits intoilikenwf:masterfrom
Rongronggg9:perf/various

Conversation

@Rongronggg9
Copy link
Contributor

@Rongronggg9 Rongronggg9 commented Mar 15, 2025

This is a follow-up of dc4ef45 (#239, "perf: speed up (>3x) get_uris, display_downloadfile with bash builtin"). In that patch, I left some performance hot spots unchanged.

The patchset optimized the remaining hot spots while keeping the readability as much as possible. I used several optimization techniques, which were documented in the commit messages. Optimizations were split into several commits and were benchmarked individually. Check the commit messages out if you are confused about the mechanism of these techniques or would like to know the speedup when applying alone.

A brief test (excluding the overhead of apt --print-uris) shows that the overall speedup is 8.75x (5.86/0.67). This is done by executing the following:

$ apt -y --print-uris dist-upgrade | tee test-data | wc -l
3347
$ grep -E "^'(http(s|)|(s|)ftp)://" test-data | wc -l
1659
$ echo $'#!/bin/sh\nexec cat "test-data"' >apt && chmod +x apt
$ sudo PATH="$PWD:$PATH" APT_FAST_TIMEOUT=0 time apt-fast upgrade

Before:

5.20user 1.32system 0:05.86elapsed 111%CPU (0avgtext+0avgdata 12380maxresident)k
0inputs+0outputs (0major+1356629minor)pagefaults 0swaps

After:

0.47user 0.23system 0:00.67elapsed 104%CPU (0avgtext+0avgdata 12344maxresident)k
0inputs+0outputs (0major+77138minor)pagefaults 0swaps

To maintainers: I wish my commit message in each commit (instead of only the cover letter) could be preserved when merging.

This is a follow-up of dc4ef45 ("perf: speed up (>3x) get_uris,
display_downloadfile with bash builtin"). In that patch, I left the
calculation of file size unchanged. It turns out that it is the severest
performance hot spot remaining.

The patch utilizes numfmt's cut-style formatting (--field=N) so that we
no longer need to convert file sizes into human-readable forms
repeatedly while parsing uris_full. This eliminates a lot of overheads
of subshell and piping, which consist of expensive fork()s and
execve()s.

A brief test (excluding the overhead of apt --print-uris) shows that the
speedup is 2.06x (5.86/2.85). This is done by executing the following:
$ apt -y --print-uris dist-upgrade | tee test-data | wc -l
3347
$ grep -E "^'(http(s|)|(s|)ftp)://" test-data | wc -l
1659
$ echo $'#!/bin/sh\nexec cat "test-data"' >apt && chmod +x apt
$ sudo PATH="$PWD:$PATH" APT_FAST_TIMEOUT=0 time apt-fast upgrade

Before:
5.20user 1.32system 0:05.86elapsed 111%CPU (0avgtext+0avgdata 12380maxresident)k
0inputs+0outputs (0major+1356629minor)pagefaults 0swaps
After:
2.20user 0.76system 0:02.85elapsed 103%CPU (0avgtext+0avgdata 12436maxresident)k
0inputs+0outputs (0major+616493minor)pagefaults 0swaps
The patch utilizes bash operator += to speed up the appending to
DOWNLOAD_DISPLAY and the accumulation to DOWNLOAD_SIZE. This means the
modification can be done in place, eliminating the overhead of template
substitution and variable reassignment.

A brief test (excluding the overhead of apt --print-uris) shows that the
speedup is 1.28x (2.83/2.22).

Before:
2.20user 0.76system 0:02.85elapsed 103%CPU (0avgtext+0avgdata 12436maxresident)k
0inputs+0outputs (0major+616493minor)pagefaults 0swaps
After:
1.73user 0.60system 0:02.22elapsed 105%CPU (0avgtext+0avgdata 12300maxresident)k
0inputs+0outputs (0major+487300minor)pagefaults 0swaps
The only user of urldecode() is get_uris() and the former is only called
in one place in a loop. It turns out that it is one of the severest
performance hot spots remaining (the other one is get_auth(), but
inlining it would hurt the readability).

The patch inlines urldecode() into the only place calling it, while
conditionally skipping it when the file name does not contain any
urlencoded string. When a file name needs to be decoded, the inlining
eliminates the overheads of subshell and function calls, which consist
of an expensive fork(). When a file name doesn't need to be decoded
(this is very likely, in my case, 73.8%), skipping decoding further
eliminates an additional subshell to call printf.

A brief test (excluding the overhead of apt --print-uris) shows that the
speedup is 1.37x (2.22/1.62).

Before:
1.73user 0.60system 0:02.22elapsed 105%CPU (0avgtext+0avgdata 12300maxresident)k
0inputs+0outputs (0major+487300minor)pagefaults 0swaps
After:
1.25user 0.45system 0:01.62elapsed 105%CPU (0avgtext+0avgdata 12328maxresident)k
0inputs+0outputs (0major+331489minor)pagefaults 0swaps
It turns out that the severest performance hot spot remaining is
get_auth(), especially when it is not explicitly disabled but no netrc
is configured, which should be most users' case. In such a scenario,
calling get_auth() is effectively a no-op with significant overhead.

The patch disables APT_FAST_APT_AUTH when prepare_auth finds no netrc is
configured and makes get_uris() only call get_auth when
APT_FAST_APT_AUTH is enabled. This eliminates the overheads of
subshells, empty loops, and function calls. Users needing apt_auth
should be almost unaffected since the newly introduced conditional
branch only has the additional cost of ~1us.

A brief test (excluding the overhead of apt --print-uris) shows that the
speedup is 2.42x (1.62/0.67).

Before:
1.25user 0.45system 0:01.62elapsed 105%CPU (0avgtext+0avgdata 12328maxresident)k
0inputs+0outputs (0major+331489minor)pagefaults 0swaps
After:
0.47user 0.23system 0:00.67elapsed 104%CPU (0avgtext+0avgdata 12344maxresident)k
0inputs+0outputs (0major+77138minor)pagefaults 0swaps
@Rongronggg9
Copy link
Contributor Author

Rongronggg9 commented Mar 15, 2025

Fun fact: the speedup of #239 turns out to be 5.28x (30.89/5.85) while excluding the overhead of apt --print-uris.

Before:

32.24user 6.67system 0:30.89elapsed 125%CPU (0avgtext+0avgdata 12316maxresident)k
0inputs+0outputs (1major+8288603minor)pagefaults 0swaps

After:

5.40user 1.13system 0:05.85elapsed 111%CPU (0avgtext+0avgdata 12452maxresident)k
0inputs+0outputs (0major+1389815minor)pagefaults 0swaps

zhiyuan1i pushed a commit to spark-store-project/spark-store that referenced this pull request Mar 17, 2025
Copy link

@aminya aminya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using this in production in setup-cpp, and it works without any issues. Nice speed boost!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants