Fix O(N^2) peak memory when re-binning multi-dim binned data#3840
Fix O(N^2) peak memory when re-binning multi-dim binned data#3840SimonHeybrock merged 4 commits intomainfrom
Conversation
Route re-binning along existing outer dimensions to combine_bins instead of falling through to _cpp.bin(). The dispatch logic in make_binned now detects when edges reference dimensions that are already present in the data but absent from the erase list (due to _find_replaced_dims cancelling them out), and extends the erase set so that _can_operate_on_bins can route to the efficient combine_bins path. This fixes O(N^2) peak memory and time scaling in unchanged dimensions when re-binning multi-dimensional binned data where the re-binned dimension has no event-level coordinate (only an outer coordinate). Prompt: Please use a new branch and address #3839. Before making code changes make sure we have actual tests that perform actual checks of data with good fidelity. We need to make sure this optimization does not introduce bugs.
When hide_masked drops rows for a 1-D mask, the coords passed to _combine_bins still had the original (unmasked) sizes, causing a DimensionError. Filter coords to match the masked data shape. Also transpose the result of the re-binning optimization to preserve the input's dimension order, since combine_bins puts unchanged dims first which differs from the _cpp.bin behavior.
Tested with the essdiffraction use case (scipp/essdiffraction#238)Isolated re-binning step (synthetic data)Ran the exact # Replicates focus_data_dspacing_and_two_theta + group_two_theta
focused = pixel_data.bin(two_theta=1024, dspacing=n_dsp)
focused.coords['two_theta'] = sc.midpoints(focused.coords['two_theta'])
result = focused.bin(two_theta=200) # re-bin two_theta, dspacing unchangedBefore (scipp 26.1.1)O(N²) in After (this PR, scipp 25.12.0.dev4)Zero measurable peak memory overhead, linear time scaling. The Full DREAM Sciline workflowRan the full Subsampled test data (10K events per run type)Full Geant4 simulated data (~2.5M / 1.9M / 0.9M events for sample / vanadium / empty can)Both peak memory and time are flat across all |
Remove isinstance(x, DataArray) check that is always true at that point since all Variable cases are converted earlier. Prompt: Why did Ubuntu CI fail on the PR? / yes, fix it
…m coords The rebinning detection in make_binned only iterated over edges, missing the identical optimization opportunity for groups (re-grouping along an outer-only dimension). Extended the detection to also consider groups. Added a guard requiring the outer coordinate to be 1-D (x.coords[dim].dims == (dim,)). Without this, a multi-dimensional outer coord incorrectly triggers the optimization, causing a dimension mismatch in the transpose call. This affects both edges and groups, though the edge case hadn't been triggered by existing tests. Prompt: Is the branch missing a similar optimization for groups? It only checks e in edge currently. Follow-up: Yes, start by writing tests for this case. Continue! Commit and push, update PR message if applicable. "it just hasn't been triggered because all the edge tests use 1-D outer coords" sounds like we are missing a test? Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
Fixes #3839
Fixes scipp/essdiffraction#238
combine_binsinstead of falling through to_cpp.bin(), which has O(N^2) peak memory and time in unchanged dimensionsMotivation
Re-binning already-binned data along one dimension with
data.bin(dim=new_edges)fell through to_cpp.bin()because_find_replaced_dimsreturns an empty erase list for this case (the re-binned dimension cancels out inreplaced - set(dims)). Witherase=[],_can_operate_on_binsreturnsFalse, bypassing the efficientcombine_binspath.The fix detects dimensions where:
These conditions identify re-binning/re-grouping, and the erase set is extended so
combine_binscan handle it in O(N).