Skip to content

Generate signature map at build time#1498

Merged
WardBrian merged 12 commits intomasterfrom
pre-generate-signatures
Mar 13, 2025
Merged

Generate signature map at build time#1498
WardBrian merged 12 commits intomasterfrom
pre-generate-signatures

Conversation

@WardBrian
Copy link
Copy Markdown
Member

This is something that came out of some profiling I did recently. Essentially, for small-medium sized models, the compiler spends most of its time just loading the Stan_math_signatures module and the large hashtable it contains of function signatures:

Call graph 'stanc test/integration/good/code-gen/8_schools_ncp.stan':
---------------------------------------------------------------------
[   63.78M cycles in     1 calls ]     - 51.02% : load(Stan_math_signatures)
[   27.97M cycles in   135 calls ]     |   - 43.86% : Stan_math_signatures.mk_declarative_sig
[    4.62M cycles in  2119 calls ]     |   |   - 16.52% : Stan_math_signatures.rng_return_type
[    1.63M cycles in  2980 calls ]     |   |   |   - 35.23% : Stan_math_signatures.is_primitive
[    3.38M cycles in   591 calls ]     |   |   - 12.07% : Stan_math_signatures.expand_arg
[    1.42M cycles in  2280 calls ]     |   |   |   - 42.00% : Stan_math_signatures.bare_array_type
[    2.56M cycles in   242 calls ]     |   |   -  9.17% : Stan_math_signatures.all_combinations
[    1.20M cycles in  2041 calls ]     |   |   -  4.30% : Stan_math_signatures.ints_to_real
[   18.66M cycles in    17 calls ]     |   - 29.25% : Stan_math_signatures.add_binary_vec
[   17.41M cycles in    17 calls ]     |   |   - 93.33% : Stan_math_signatures.add_binary_vec_general
[    3.88M cycles in  7480 calls ]     |   |   |   - 22.31% : Stan_math_signatures.bare_array_type
[    2.18M cycles in  3468 calls ]     |   |   |   - 12.54% : Stan_math_signatures.add_unqualified
[    1.71M cycles in  3468 calls ]     |   |   |   -  9.79% : Stan_math_signatures.ints_to_real
[    1.59M cycles in     6 calls ]     |   -  2.50% : Stan_math_signatures.add_binary_vec_int_real
[  516.48K cycles in  1056 calls ]     |   |   - 32.41% : Stan_math_signatures.bare_array_type
[  242.23K cycles in   432 calls ]     |   |   - 15.20% : Stan_math_signatures.add_unqualified
[  990.39K cycles in     3 calls ]     |   -  1.55% : Stan_math_signatures.add_binary_vec_real_real
[  986.71K cycles in     3 calls ]     |   |   - 99.63% : Stan_math_signatures.add_binary_vec_general
[  312.77K cycles in   672 calls ]     |   |   |   - 31.70% : Stan_math_signatures.bare_array_type
[  153.63K cycles in   291 calls ]     |   |   |   - 15.57% : Stan_math_signatures.add_unqualified
[  887.62K cycles in  1232 calls ]     |   -  1.39% : Stan_math_signatures.add_unqualified
[  806.55K cycles in     3 calls ]     |   -  1.26% : Stan_math_signatures.add_binary_vec_real_int
[  251.65K cycles in   528 calls ]     |   |   - 31.20% : Stan_math_signatures.bare_array_type
[  120.48K cycles in   216 calls ]     |   |   - 14.94% : Stan_math_signatures.add_unqualified
[   39.95M cycles in     1 calls ]     - 31.96% : load(Environment)
[   16.87M cycles in 22451 calls ]     |   - 42.22% : Fun_kind.suffix_from_name
[   19.19M cycles in     1 calls ]     - 15.35% : load(stanc)

This is particularly annoying because this hashtable is static data. After some discussions on the OCaml forums about options, the following seems like the least intrusive:

  • Have a small standalone executable that does the work of computing all the overloads, and generates a small module which contains a compact, binary representation of this
  • Have dune integrate this into the build seamlessly
  • At runtime, just deserialize that representation and proceed as before

This effectively removes 90% of the overhead of "doing nothing" in the compiler, from ~90M cycles above to ~8M cycles:

Call graph 'stanc test/integration/good/code-gen/8_schools_ncp.stan':
---------------------------------------------------------------------
[   22.59M cycles in   1 calls ]     - 62.60% : load(stanc)
[    7.55M cycles in   1 calls ]     - 20.91% : load(Generated_signatures)
[    2.37M cycles in   1 calls ]     -  6.57% : load(Lower_expr)
[    1.59M cycles in   1 calls ]     -  4.41% : load(Environment)

Remarkably, this also doesn't really change the binary size at all, because the generated module is only a few hundred kb as source text.

Besides making the compiler faster for end users, I also observe about a net 20% speed up of dune runtest, which is nice as well.

There are some unfortunate parts of doing this:

  • Avoiding circular dependencies in dune made it necessary to move the Stan_math_signatures file out of Middle -- though I had wanted to do this anyway.
  • To avoid the generator and generated files trying to depend on each other, any common type definitions needed to be moved somewhere. I took this opportunity to give UnsizedType.ml a few new helper definitions, which really cleaned things up, but it does mean some code in weird places got touched by this change.
  • As a result of the above, the sort order of signatures in the printed output changed. It now sorts by argument type, previously it was sorting by return type. I think the new behavior actually makes more sense. This is the only reason any files in test/ changed.

Submission Checklist

  • Run unit tests
  • Documentation
    • If a user-facing facing change was made, the documentation PR is here:
    • OR, no user-facing changes were made

Release notes

Decreased the start-up overhead of the compiler.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

@WardBrian WardBrian added cleanup Code simplification or clean-up performance for issues that effect performance labels Mar 6, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 84.69388% with 15 lines in your changes missing coverage. Please review.

Project coverage is 89.52%. Comparing base (7f5261e) to head (9c82399).
Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
src/stan_math_signatures/Stan_math_signatures.ml 83.90% 14 Missing ⚠️
src/middle/UnsizedType.ml 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1498      +/-   ##
==========================================
- Coverage   90.46%   89.52%   -0.94%     
==========================================
  Files          66       66              
  Lines       10712     9684    -1028     
==========================================
- Hits         9691     8670    -1021     
+ Misses       1021     1014       -7     
Files with missing lines Coverage Δ
src/analysis_and_optimization/Memory_patterns.ml 88.69% <100.00%> (-0.08%) ⬇️
src/frontend/Environment.ml 95.23% <100.00%> (-0.12%) ⬇️
src/frontend/SignatureMismatch.ml 85.16% <100.00%> (ø)
src/frontend/Typechecker.ml 93.04% <100.00%> (-0.02%) ⬇️
src/stan_math_backend/Lower_expr.ml 93.57% <100.00%> (ø)
src/stanc/stanc.ml 86.66% <ø> (ø)
src/middle/UnsizedType.ml 82.19% <66.66%> (+0.18%) ⬆️
src/stan_math_signatures/Stan_math_signatures.ml 83.90% <83.90%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@WardBrian
Copy link
Copy Markdown
Member Author

Some more timings (total wall-clock):

Benchmark 1: ./stanc-marshal-4.14 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      19.1 ms ±   1.0 ms    [User: 11.5 ms, System: 7.5 ms]
  Range (min … max):    16.5 ms …  22.6 ms    168 runs
 
Benchmark 2: ./stanc-4.14 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      27.8 ms ±   1.5 ms    [User: 19.9 ms, System: 7.7 ms]
  Range (min … max):    24.5 ms …  31.9 ms    100 runs
 
Summary
  ./stanc-marshal-4.14 test/integration/good/code-gen/8_schools_ncp.stan ran
    1.46 ± 0.11 times faster than ./stanc-4.14 test/integration/good/code-gen/8_schools_ncp.stan

When using OCaml 5, this advantage does shrink, I believe due to a known regression

Benchmark 1: ./stanc-marshal-5.3 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      27.1 ms ±   1.2 ms    [User: 18.9 ms, System: 8.0 ms]
  Range (min … max):    24.6 ms …  31.2 ms    111 runs
 
Benchmark 2: ./stanc-5.3 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      35.7 ms ±   2.9 ms    [User: 27.6 ms, System: 7.9 ms]
  Range (min … max):    30.9 ms …  53.5 ms    85 runs
 
Summary
  ./stanc-marshal-5.3 test/integration/good/code-gen/8_schools_ncp.stan ran
    1.32 ± 0.12 times faster than ./stanc-5.3 test/integration/good/code-gen/8_schools_ncp.stan

On a much larger model:
OCaml 4.14:

Benchmark 1: ./stanc-marshal-4.14 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      63.6 ms ±   3.9 ms    [User: 52.9 ms, System: 10.5 ms]
  Range (min … max):    57.2 ms …  80.6 ms    48 runs
 
Benchmark 2: ./stanc-4.14 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      79.4 ms ±   4.4 ms    [User: 69.3 ms, System: 9.9 ms]
  Range (min … max):    75.3 ms …  97.4 ms    39 runs
 
Summary
  ./stanc-marshal-4.14 test/integration/good/code-gen/mother.stan ran
    1.25 ± 0.10 times faster than ./stanc-4.14 test/integration/good/code-gen/mother.stan

OCaml 5.3 (note that on this larger model, 5.3 is faster than 4.14 on the master implementation, which also makes the overall speedup look worse):

Benchmark 1: ./stanc-marshal-5.3 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      72.2 ms ±   3.0 ms    [User: 60.1 ms, System: 11.7 ms]
  Range (min … max):    66.7 ms …  80.8 ms    41 runs
 
Benchmark 2: ./stanc-5.3 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      76.8 ms ±   2.3 ms    [User: 65.3 ms, System: 11.3 ms]
  Range (min … max):    71.1 ms …  81.3 ms    41 runs
 
Summary
  ./stanc-marshal-5.3 test/integration/good/code-gen/mother.stan ran
    1.06 ± 0.06 times faster than ./stanc-5.3 test/integration/good/code-gen/mother.stan

Copy link
Copy Markdown
Collaborator

@nhuurre nhuurre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only reason any files in test/ changed.

That's the only reason test/integration/{bad,signatures} changed. Various test/**/mir.expected files got additional parens around the newly named UnsizedType.signature.
Also test/unit/dune lost a couple of ppx_deriving, not sure why they were ever there?

@WardBrian
Copy link
Copy Markdown
Member Author

WardBrian commented Mar 7, 2025

Also test/unit/dune lost a couple of ppx_deriving, not sure why they were ever there?

Yes, this was a bit of a drive-by, but every time I learn a bit more about dune I realize we've been doing something unnecessary like over-depending on things we're not using in those modules (like in #1489).

These were edited when I thought doing this would require depending on ppx_blob so I was looking at our existing ppx usage. Eventually I realized we didn't need the new dependency if I just had the generated file be a fully-formed .ml

@WardBrian WardBrian requested a review from nhuurre March 10, 2025 16:32
@WardBrian WardBrian merged commit 3aa50a9 into master Mar 13, 2025
3 checks passed
@WardBrian WardBrian deleted the pre-generate-signatures branch March 13, 2025 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup Code simplification or clean-up performance for issues that effect performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants