[ty] Pluck some low hanging performance fruit for completions#22630
[ty] Pluck some low hanging performance fruit for completions#22630BurntSushi merged 10 commits intomainfrom
Conversation
Typing conformance resultsNo changes detected ✅ |
|
a5ac57d to
68e9b49
Compare
|
There was a problem hiding this comment.
Can you say more about why you chose to create an entirely separate crate over integrating the benchmarks into ty_walltime or the Python benchmarks in scripts/ty_benchmark (which even includes code to benchmark an LSP server). Integrating into ty_walltime has the advantage that we could decide to run the benchmarks as part of our CI pipeline. Integrating it into ty_benchmark has the advantage that they're easier to discover. Both benchmark also already provide the necessary infrastructure to install dependency and definitions for common ecosystem projects.
There was a problem hiding this comment.
Honestly, the reason is partially just that I wasn't aware of those. :-) And this particular program is just a copy of ty_completion_eval but stripped down for ad hoc benchmarking of completions specifically. So it was extremely quick and easy to add.
Looking at scripts/ty_benchmark, it looks like that requires doing a full build of ty which means longer iteration times:
$ touch ./crates/ty_ide/src/completion.rs && time cargo build --bin ty_completion_bench --profile profiling
Compiling ty_ide v0.0.0 (/home/andrew/astral/ruff/pr1/crates/ty_ide)
Compiling ty_completion_bench v0.0.0 (/home/andrew/astral/ruff/pr1/crates/ty_completion_bench)
Finished `profiling` profile [optimized + debuginfo] target(s) in 3.24s
real 3.313
user 16.869
sys 1.102
maxmem 1227 MB
faults 1767
$ touch ./crates/ty_ide/src/completion.rs && time cargo build --bin ty --profile profiling
Compiling ty_ide v0.0.0 (/home/andrew/astral/ruff/pr1/crates/ty_ide)
Compiling ty_server v0.0.0 (/home/andrew/astral/ruff/pr1/crates/ty_server)
Compiling ty v0.0.0 (/home/andrew/astral/ruff/pr1/crates/ty)
Finished `profiling` profile [optimized + debuginfo] target(s) in 8.93s
real 9.005
user 1:15.21
sys 1.702
maxmem 1539 MB
faults 2140
The program was also made with profiling and ad hoc testing in mind. I'd have to actually go through the process of adding completions to ty_benchmark to figure out whether it would work for that. But my program does the minimal amount of work necessary to do a completion request and lets you also repeat that completion request to test the cached case. And you can see the actual completion output, which can be important for understanding what is actually happening. Similarly for ty_walltime, which uses divan and doesn't seem to have a good profiling story.
Now to be clear, I do think we should add completion benchmarks to both ty_benchmark and ty_walltime. Because, yes, running the benchmarks in CI and running them as part of a more complete LSP flow seems like a wise thing to do. But I'd still want this little CLI for ad hoc testing and profiling.
I can add a README noting all of this. Maybe we can get to a point where the CLI isn't needed any more.
There was a problem hiding this comment.
Thanks for the context.
I think my concern mainly is this
Honestly, the reason is partially just that I wasn't aware of those. :-)
But that's addressed if the plan is to ultimately add those benchmarks to ty_walltime, and this is more of an ad-hoc script.
| db: &'db dyn Db, | ||
| context: CollectionContext<'db>, | ||
| items: Vec<Completion<'db>>, | ||
| items: BinaryHeap<CompletionRanker<'db>>, |
There was a problem hiding this comment.
Nice! That's a much better approach than my brute force truncate call
It's sometimes useful to just create a scratch environment for ad hoc testing.
This is mostly just a stripped down version of `ty_completion_eval`. Basically, you point it at a Python file, give it a byte offset and it does the rest. It also lets one repeat the completion request after the initial request for benchmarking the cached case.
A surprising amount of time is spent sorting and comparing completions. This commit does a very simple thing: it switches the sorts to unstable and optimizes some comparisons to avoid retrieving data we don't need for sorting. Before: ``` $ ./target/profiling/ty_completion_bench ~/astral/relatedclones/scratch-home-assistant/homeassistant/scratch.py 1 -q --iters 30 total elapsed for initial completions request: 603.491595ms total elapsed: 7.36554807s, time per completion request: 245.518269ms ``` After: ``` $ ./target/profiling/ty_completion_bench ~/astral/relatedclones/scratch-home-assistant/homeassistant/scratch.py 1 -q --iters 30 total elapsed for initial completions request: 570.040193ms total elapsed: 5.740541847s, time per completion request: 191.351394ms ```
Instead of locking and unlocking the mutex for every symbol, we gather what we can and only lock the mutex after processing a module. Before: ``` $ ./target/profiling/ty_completion_bench ~/astral/relatedclones/scratch-home-assistant/homeassistant/scratch.py 1 -q --iters 30 total elapsed for initial completions request: 579.057075ms total elapsed: 5.627666455s, time per completion request: 187.588881ms ``` After: ``` $ ./target/profiling/ty_completion_bench ~/astral/relatedclones/scratch-home-assistant/homeassistant/scratch.py 1 -q --iters 30 total elapsed for initial completions request: 565.911487ms total elapsed: 5.254306446s, time per completion request: 175.143548ms ``` This is a very naive/simplistic "improvement." We can probably do better.
This should be a non-behavior-changing commit that tweaks how we rank completions. We now use a wrapper type and include the module name for more precise sorting when relevance and symbol name are identical. The purpose of this is to prepare the code to switch to a max-heap.
It turns out this wasn't actually doing much and was just papering over the possibility of duplicates being returned from `SemanticModel`. So this was doing a fair bit of work for no good reason. Instead, we push deduplication down into semantic completions directly. This will run over a (typically) much smaller number of completions. With that said, this doesn't seem to lead to improvement in ad hoc benchmarking. However, perf wasn't really the main motivation here: this change is primarily to prep for switching to a max-heap. This deduplication was problematic because it was being done *after* sorting, which meant it dependend on having the entire set of completions in memory. But with a max-heap, we'll only keep around the top-K completions, so deduplication has to be done "earlier" (if at all).
This is effectively what's going to happen when we switch to a max-heap for collecting completions. This commit isolates that behavioral change to ensure nothing unexpected happens. (I generally wouldn't expect it to, since that would imply that imports/qualifications care about results beyond the first 1,000.)
The previous commits made this change deliciously simple. The key point is that we don't collect a completion if it has no way of being ranked into the top 1000 that we return. Before: ``` $ ./target/profiling/ty_completion_bench ~/astral/relatedclones/scratch-home-assistant/homeassistant/scratch.py 1 -q --iters 30 total elapsed for initial completions request: 554.746713ms total elapsed: 5.435318983s, time per completion request: 181.177299ms After: ``` $ ./target/profiling/ty_completion_bench ~/astral/relatedclones/scratch-home-assistant/homeassistant/scratch.py 1 -q --iters 30 total elapsed for initial completions request: 526.743638ms total elapsed: 4.268009725s, time per completion request: 142.26699ms ``` This is an especially nice speed-up for the cached case, where lots of time (proportionally) was being spent sorting potentially a very large list of completions. There is potential future wins here. For example, every auto-import completion has an `import` statement generated for it. We could instead defer to that work until the very end when we convert the max-heap into a `Vec`. Since before then, the completion might get dropped and there's no point in doing extra work. But that will require more refactoring (and possibly another `CompletionIntermediate` type? blech).
I didn't dig into where this change came from, but 1) it's a small change and 2) this one tends to be noisy since there are so many things that match `array`.
68e9b49 to
0fba952
Compare
0fba952 to
1e49a13
Compare
|
I created astral-sh/ty#2570 for tracking adding benchmarks to |
1e49a13 to
7402377
Compare
This PR adds a new ad hoc benchmarking CLI for completions and
implements a few low hanging optimizations. All told, for my particular
benchmark (asking for completions on
rin a blank file within acheckout of Home Assistant), we get about a 40% improvement (~250ms
down to ~140ms) in the cached case and a more modest ~12.5% improvement
in the uncached case (~600ms down to ~526ms).
Before:
After:
Closes astral-sh/ty#2298