Skip to content

Fix O(N^2) peak memory when re-binning multi-dim binned data#3840

Merged
SimonHeybrock merged 4 commits intomainfrom
fix/rebin-multidim-memory-3839
Feb 11, 2026
Merged

Fix O(N^2) peak memory when re-binning multi-dim binned data#3840
SimonHeybrock merged 4 commits intomainfrom
fix/rebin-multidim-memory-3839

Conversation

@SimonHeybrock
Copy link
Copy Markdown
Member

@SimonHeybrock SimonHeybrock commented Feb 10, 2026

Summary

Fixes #3839
Fixes scipp/essdiffraction#238

  • Route re-binning and re-grouping along existing outer dimensions to combine_bins instead of falling through to _cpp.bin(), which has O(N^2) peak memory and time in unchanged dimensions
  • Guard against incorrectly triggering the optimization when the outer coordinate is multi-dimensional (which requires genuine dimension collapse, not bin merging)

Motivation

Re-binning already-binned data along one dimension with data.bin(dim=new_edges) fell through to _cpp.bin() because _find_replaced_dims returns an empty erase list for this case (the re-binned dimension cancels out in replaced - set(dims)). With erase=[], _can_operate_on_bins returns False, bypassing the efficient combine_bins path.

The fix detects dimensions where:

  • The edge/group is 1-D and references an existing dimension of the data
  • The dimension is NOT an event coordinate (only an outer coordinate)
  • The outer coordinate is 1-D (only spans its own dimension — multi-dimensional coords require genuine dimension collapse)

These conditions identify re-binning/re-grouping, and the erase set is extended so combine_bins can handle it in O(N).

Route re-binning along existing outer dimensions to combine_bins
instead of falling through to _cpp.bin(). The dispatch logic in
make_binned now detects when edges reference dimensions that are
already present in the data but absent from the erase list (due to
_find_replaced_dims cancelling them out), and extends the erase set
so that _can_operate_on_bins can route to the efficient combine_bins
path.

This fixes O(N^2) peak memory and time scaling in unchanged dimensions
when re-binning multi-dimensional binned data where the re-binned
dimension has no event-level coordinate (only an outer coordinate).

Prompt: Please use a new branch and address #3839. Before making code
changes make sure we have actual tests that perform actual checks of
data with good fidelity. We need to make sure this optimization does
not introduce bugs.
When hide_masked drops rows for a 1-D mask, the coords passed to
_combine_bins still had the original (unmasked) sizes, causing a
DimensionError. Filter coords to match the masked data shape.

Also transpose the result of the re-binning optimization to preserve
the input's dimension order, since combine_bins puts unchanged dims
first which differs from the _cpp.bin behavior.
@SimonHeybrock
Copy link
Copy Markdown
Member Author

SimonHeybrock commented Feb 11, 2026

Tested with the essdiffraction use case (scipp/essdiffraction#238)

Isolated re-binning step (synthetic data)

Ran the exact group_two_theta code path from essdiffraction with synthetic data (100K pixels × 100 events/pixel = 10M events, 3 event coords) in isolated subprocesses, varying n_dspacing (the unchanged dimension):

# Replicates focus_data_dspacing_and_two_theta + group_two_theta
focused = pixel_data.bin(two_theta=1024, dspacing=n_dsp)
focused.coords['two_theta'] = sc.midpoints(focused.coords['two_theta'])
result = focused.bin(two_theta=200)  # re-bin two_theta, dspacing unchanged

Before (scipp 26.1.1)

n_dspacing   peak_MB    time_s
        51    2291.0      0.10
       201    3249.0      0.93
       401    9812.0      3.38
       801   37407.0     12.39
      1001   57550.0     18.24

O(N²) in n_dspacing — 43 GB peak for the real-world parameter range, crashing a 32 GB machine.

After (this PR, scipp 25.12.0.dev4)

n_dspacing   peak_MB    time_s
        51       0.0      0.02
       201       0.0      0.03
       401       0.0      0.05
       801       0.0      0.07
      1001       0.0      0.08

Zero measurable peak memory overhead, linear time scaling. The data.bin() call now correctly routes through combine_bins.

Full DREAM Sciline workflow

Ran the full DreamGeant4Workflow (CSV loading → ToF conversion → focusing → normalization → group_two_theta) computing IntensityDspacingTwoTheta[SampleRun], varying DspacingBins. This is the same setup as test_pipeline_group_by_two_theta in essdiffraction's test suite.

Subsampled test data (10K events per run type)

n_dspacing   peak_MB    time_s
        51    3490.9      1.32
       201    3490.8      1.24
       401    3491.7      1.23
       801    3522.1      1.27
      1001    3523.3      1.27

Full Geant4 simulated data (~2.5M / 1.9M / 0.9M events for sample / vanadium / empty can)

n_dspacing   peak_MB    time_s
       201    4794.5      4.31
       801    4982.4      4.48
      2001    4925.3      4.59

Both peak memory and time are flat across all DspacingBins values. The ~3.5/5.0 GB baselines are the cost of loading and processing Geant4 CSV data through the pipeline — independent of the binning parameter that previously caused the blow-up.

SimonHeybrock and others added 2 commits February 11, 2026 06:53
Remove isinstance(x, DataArray) check that is always true at that
point since all Variable cases are converted earlier.

Prompt: Why did Ubuntu CI fail on the PR? / yes, fix it
…m coords

The rebinning detection in make_binned only iterated over edges, missing
the identical optimization opportunity for groups (re-grouping along an
outer-only dimension). Extended the detection to also consider groups.

Added a guard requiring the outer coordinate to be 1-D
(x.coords[dim].dims == (dim,)). Without this, a multi-dimensional outer
coord incorrectly triggers the optimization, causing a dimension mismatch
in the transpose call. This affects both edges and groups, though the
edge case hadn't been triggered by existing tests.

Prompt: Is the branch missing a similar optimization for groups? It only
checks e in edge currently.
Follow-up: Yes, start by writing tests for this case. Continue! Commit
and push, update PR message if applicable. "it just hasn't been triggered
because all the edge tests use 1-D outer coords" sounds like we are
missing a test?

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@SimonHeybrock SimonHeybrock marked this pull request as ready for review February 11, 2026 09:00
@github-project-automation github-project-automation bot moved this to In progress in Development Board Feb 11, 2026
@SimonHeybrock SimonHeybrock moved this from In progress to Selected in Development Board Feb 11, 2026
@SimonHeybrock SimonHeybrock merged commit f10f268 into main Feb 11, 2026
4 checks passed
@SimonHeybrock SimonHeybrock deleted the fix/rebin-multidim-memory-3839 branch February 11, 2026 09:36
@github-project-automation github-project-automation bot moved this from Selected to Done in Development Board Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

O(N²) peak memory when re-binning one dimension of multi-dimensional binned data Increasing number of dspacing bins eats up RAM

2 participants