Generate signature map at build time by WardBrian · Pull Request #1498 · stan-dev/stanc3

WardBrian · 2025-03-06T18:11:16Z

This is something that came out of some profiling I did recently. Essentially, for small-medium sized models, the compiler spends most of its time just loading the Stan_math_signatures module and the large hashtable it contains of function signatures:

Call graph 'stanc test/integration/good/code-gen/8_schools_ncp.stan':
---------------------------------------------------------------------
[   63.78M cycles in     1 calls ]     - 51.02% : load(Stan_math_signatures)
[   27.97M cycles in   135 calls ]     |   - 43.86% : Stan_math_signatures.mk_declarative_sig
[    4.62M cycles in  2119 calls ]     |   |   - 16.52% : Stan_math_signatures.rng_return_type
[    1.63M cycles in  2980 calls ]     |   |   |   - 35.23% : Stan_math_signatures.is_primitive
[    3.38M cycles in   591 calls ]     |   |   - 12.07% : Stan_math_signatures.expand_arg
[    1.42M cycles in  2280 calls ]     |   |   |   - 42.00% : Stan_math_signatures.bare_array_type
[    2.56M cycles in   242 calls ]     |   |   -  9.17% : Stan_math_signatures.all_combinations
[    1.20M cycles in  2041 calls ]     |   |   -  4.30% : Stan_math_signatures.ints_to_real
[   18.66M cycles in    17 calls ]     |   - 29.25% : Stan_math_signatures.add_binary_vec
[   17.41M cycles in    17 calls ]     |   |   - 93.33% : Stan_math_signatures.add_binary_vec_general
[    3.88M cycles in  7480 calls ]     |   |   |   - 22.31% : Stan_math_signatures.bare_array_type
[    2.18M cycles in  3468 calls ]     |   |   |   - 12.54% : Stan_math_signatures.add_unqualified
[    1.71M cycles in  3468 calls ]     |   |   |   -  9.79% : Stan_math_signatures.ints_to_real
[    1.59M cycles in     6 calls ]     |   -  2.50% : Stan_math_signatures.add_binary_vec_int_real
[  516.48K cycles in  1056 calls ]     |   |   - 32.41% : Stan_math_signatures.bare_array_type
[  242.23K cycles in   432 calls ]     |   |   - 15.20% : Stan_math_signatures.add_unqualified
[  990.39K cycles in     3 calls ]     |   -  1.55% : Stan_math_signatures.add_binary_vec_real_real
[  986.71K cycles in     3 calls ]     |   |   - 99.63% : Stan_math_signatures.add_binary_vec_general
[  312.77K cycles in   672 calls ]     |   |   |   - 31.70% : Stan_math_signatures.bare_array_type
[  153.63K cycles in   291 calls ]     |   |   |   - 15.57% : Stan_math_signatures.add_unqualified
[  887.62K cycles in  1232 calls ]     |   -  1.39% : Stan_math_signatures.add_unqualified
[  806.55K cycles in     3 calls ]     |   -  1.26% : Stan_math_signatures.add_binary_vec_real_int
[  251.65K cycles in   528 calls ]     |   |   - 31.20% : Stan_math_signatures.bare_array_type
[  120.48K cycles in   216 calls ]     |   |   - 14.94% : Stan_math_signatures.add_unqualified
[   39.95M cycles in     1 calls ]     - 31.96% : load(Environment)
[   16.87M cycles in 22451 calls ]     |   - 42.22% : Fun_kind.suffix_from_name
[   19.19M cycles in     1 calls ]     - 15.35% : load(stanc)

This is particularly annoying because this hashtable is static data. After some discussions on the OCaml forums about options, the following seems like the least intrusive:

Have a small standalone executable that does the work of computing all the overloads, and generates a small module which contains a compact, binary representation of this
Have dune integrate this into the build seamlessly
At runtime, just deserialize that representation and proceed as before

This effectively removes 90% of the overhead of "doing nothing" in the compiler, from ~90M cycles above to ~8M cycles:

Call graph 'stanc test/integration/good/code-gen/8_schools_ncp.stan':
---------------------------------------------------------------------
[   22.59M cycles in   1 calls ]     - 62.60% : load(stanc)
[    7.55M cycles in   1 calls ]     - 20.91% : load(Generated_signatures)
[    2.37M cycles in   1 calls ]     -  6.57% : load(Lower_expr)
[    1.59M cycles in   1 calls ]     -  4.41% : load(Environment)

Remarkably, this also doesn't really change the binary size at all, because the generated module is only a few hundred kb as source text.

Besides making the compiler faster for end users, I also observe about a net 20% speed up of dune runtest, which is nice as well.

There are some unfortunate parts of doing this:

Avoiding circular dependencies in dune made it necessary to move the Stan_math_signatures file out of Middle -- though I had wanted to do this anyway.
To avoid the generator and generated files trying to depend on each other, any common type definitions needed to be moved somewhere. I took this opportunity to give UnsizedType.ml a few new helper definitions, which really cleaned things up, but it does mean some code in weird places got touched by this change.
As a result of the above, the sort order of signatures in the printed output changed. It now sorts by argument type, previously it was sorting by return type. I think the new behavior actually makes more sense. This is the only reason any files in test/ changed.

Submission Checklist

Run unit tests
Documentation
- If a user-facing facing change was made, the documentation PR is here:
- OR, no user-facing changes were made

Release notes

Decreased the start-up overhead of the compiler.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

Sort order changed during refactor

codecov · 2025-03-06T19:51:17Z

Codecov Report

Attention: Patch coverage is 84.69388% with 15 lines in your changes missing coverage. Please review.

Project coverage is 89.52%. Comparing base (7f5261e) to head (9c82399).
Report is 14 commits behind head on master.

Files with missing lines	Patch %	Lines
src/stan_math_signatures/Stan_math_signatures.ml	83.90%	14 Missing ⚠️
src/middle/UnsizedType.ml	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1498      +/-   ##
==========================================
- Coverage   90.46%   89.52%   -0.94%     
==========================================
  Files          66       66              
  Lines       10712     9684    -1028     
==========================================
- Hits         9691     8670    -1021     
+ Misses       1021     1014       -7

Files with missing lines	Coverage Δ
src/analysis_and_optimization/Memory_patterns.ml	`88.69% <100.00%> (-0.08%)`	⬇️
src/frontend/Environment.ml	`95.23% <100.00%> (-0.12%)`	⬇️
src/frontend/SignatureMismatch.ml	`85.16% <100.00%> (ø)`
src/frontend/Typechecker.ml	`93.04% <100.00%> (-0.02%)`	⬇️
src/stan_math_backend/Lower_expr.ml	`93.57% <100.00%> (ø)`
src/stanc/stanc.ml	`86.66% <ø> (ø)`
src/middle/UnsizedType.ml	`82.19% <66.66%> (+0.18%)`	⬆️
src/stan_math_signatures/Stan_math_signatures.ml	`83.90% <83.90%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

WardBrian · 2025-03-06T21:06:10Z

Some more timings (total wall-clock):

Benchmark 1: ./stanc-marshal-4.14 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      19.1 ms ±   1.0 ms    [User: 11.5 ms, System: 7.5 ms]
  Range (min … max):    16.5 ms …  22.6 ms    168 runs
 
Benchmark 2: ./stanc-4.14 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      27.8 ms ±   1.5 ms    [User: 19.9 ms, System: 7.7 ms]
  Range (min … max):    24.5 ms …  31.9 ms    100 runs
 
Summary
  ./stanc-marshal-4.14 test/integration/good/code-gen/8_schools_ncp.stan ran
    1.46 ± 0.11 times faster than ./stanc-4.14 test/integration/good/code-gen/8_schools_ncp.stan

When using OCaml 5, this advantage does shrink, I believe due to a known regression

Benchmark 1: ./stanc-marshal-5.3 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      27.1 ms ±   1.2 ms    [User: 18.9 ms, System: 8.0 ms]
  Range (min … max):    24.6 ms …  31.2 ms    111 runs
 
Benchmark 2: ./stanc-5.3 test/integration/good/code-gen/8_schools_ncp.stan
  Time (mean ± σ):      35.7 ms ±   2.9 ms    [User: 27.6 ms, System: 7.9 ms]
  Range (min … max):    30.9 ms …  53.5 ms    85 runs
 
Summary
  ./stanc-marshal-5.3 test/integration/good/code-gen/8_schools_ncp.stan ran
    1.32 ± 0.12 times faster than ./stanc-5.3 test/integration/good/code-gen/8_schools_ncp.stan

On a much larger model:
OCaml 4.14:

Benchmark 1: ./stanc-marshal-4.14 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      63.6 ms ±   3.9 ms    [User: 52.9 ms, System: 10.5 ms]
  Range (min … max):    57.2 ms …  80.6 ms    48 runs
 
Benchmark 2: ./stanc-4.14 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      79.4 ms ±   4.4 ms    [User: 69.3 ms, System: 9.9 ms]
  Range (min … max):    75.3 ms …  97.4 ms    39 runs
 
Summary
  ./stanc-marshal-4.14 test/integration/good/code-gen/mother.stan ran
    1.25 ± 0.10 times faster than ./stanc-4.14 test/integration/good/code-gen/mother.stan

OCaml 5.3 (note that on this larger model, 5.3 is faster than 4.14 on the master implementation, which also makes the overall speedup look worse):

Benchmark 1: ./stanc-marshal-5.3 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      72.2 ms ±   3.0 ms    [User: 60.1 ms, System: 11.7 ms]
  Range (min … max):    66.7 ms …  80.8 ms    41 runs
 
Benchmark 2: ./stanc-5.3 test/integration/good/code-gen/mother.stan
  Time (mean ± σ):      76.8 ms ±   2.3 ms    [User: 65.3 ms, System: 11.3 ms]
  Range (min … max):    71.1 ms …  81.3 ms    41 runs
 
Summary
  ./stanc-marshal-5.3 test/integration/good/code-gen/mother.stan ran
    1.06 ± 0.06 times faster than ./stanc-5.3 test/integration/good/code-gen/mother.stan

nhuurre

This is the only reason any files in test/ changed.

That's the only reason test/integration/{bad,signatures} changed. Various test/**/mir.expected files got additional parens around the newly named UnsizedType.signature.
Also test/unit/dune lost a couple of ppx_deriving, not sure why they were ever there?

src/stan_math_signatures/Generate.ml

src/stan_math_signatures/Stan_math_signatures.ml

WardBrian · 2025-03-07T21:42:40Z

Also test/unit/dune lost a couple of ppx_deriving, not sure why they were ever there?

Yes, this was a bit of a drive-by, but every time I learn a bit more about dune I realize we've been doing something unnecessary like over-depending on things we're not using in those modules (like in #1489).

These were edited when I thought doing this would require depending on ppx_blob so I was looking at our existing ppx usage. Eventually I realized we didn't need the new dependency if I just had the generated file be a fully-formed .ml

WardBrian added 9 commits March 5, 2025 16:06

Generate signature map at build time

1838d37

Improve generation

3080968

Formatting

5a44af0

Clarify whats happening at import site

b03476e

Simplify by re-using some existing code in UnsizedType

7f9085a

dune promote

22dc80e

Sort order changed during refactor

Clean up dune files

4e4fecb

Clearer interface for generated file

c408448

Update dune config

4b2288a

WardBrian added cleanup Code simplification or clean-up performance for issues that effect performance labels Mar 6, 2025

WardBrian requested review from SteveBronder and nhuurre March 6, 2025 18:11

Fix coverage expectation

1b14090

WardBrian mentioned this pull request Mar 6, 2025

OCaml 5 performance regression for unmarshal-heavy workloads ocaml/ocaml#13300

Closed

nhuurre reviewed Mar 7, 2025

View reviewed changes

src/stan_math_signatures/Generate.ml Show resolved Hide resolved

src/stan_math_signatures/Stan_math_signatures.ml Outdated Show resolved Hide resolved

Deduplicate distributions before Marshaling

71e4191

WardBrian requested a review from nhuurre March 10, 2025 16:32

nhuurre approved these changes Mar 12, 2025

View reviewed changes

Merge branch 'master' into pre-generate-signatures

9c82399

WardBrian merged commit 3aa50a9 into master Mar 13, 2025
3 checks passed

WardBrian deleted the pre-generate-signatures branch March 13, 2025 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate signature map at build time#1498

Generate signature map at build time#1498
WardBrian merged 12 commits intomasterfrom
pre-generate-signatures

WardBrian commented Mar 6, 2025

Uh oh!

codecov bot commented Mar 6, 2025 •

edited

Loading

Uh oh!

WardBrian commented Mar 6, 2025

Uh oh!

nhuurre left a comment

Uh oh!

Uh oh!

Uh oh!

WardBrian commented Mar 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WardBrian commented Mar 6, 2025

Submission Checklist

Release notes

Copyright and Licensing

Uh oh!

codecov bot commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

WardBrian commented Mar 6, 2025

Uh oh!

nhuurre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WardBrian commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 6, 2025 •

edited

Loading

WardBrian commented Mar 7, 2025 •

edited

Loading