Skip to content

[Repo Assist] perf: optimise Frame.AggregateRowsBy to avoid per-group sub-frame construction#669

Merged
dsyme merged 2 commits intomasterfrom
repo-assist/fix-issue-269-aggregaterowsby-perf-ca8ad268bcd8d2b4
Mar 19, 2026
Merged

[Repo Assist] perf: optimise Frame.AggregateRowsBy to avoid per-group sub-frame construction#669
dsyme merged 2 commits intomasterfrom
repo-assist/fix-issue-269-aggregaterowsby-perf-ca8ad268bcd8d2b4

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

🤖 This is an automated PR from Repo Assist, created in response to the maintainer's request on issue #269.

Closes #269

What and Why

Frame.aggregateRowsBy was slow at scale because the previous implementation called NestRowsBy, which built a full sub-Frame (materialising vector relocations for all columns) for every distinct group. The result was then assembled row-by-row via fromRows. For a 100 000-row frame with few distinct groups this was essentially O(n × numColumns) in vector work.

The approach in the issue comments (FoldColumnBy / AggregateColumnBy) pointed directly to the fix: do one pass to bucket rows by group key, then extract the aggregate data column-by-column.

How

  1. Single scan — builds a Dictionary(obj[], int) (using HashIdentity.Structural for element-wise key equality) that records the ordered groups and the row offsets belonging to each group. O(n).

  2. Per aggregate column — for each column in aggBy, the sub-series passed to aggFunc is built directly from the source column's values at those row offsets. No sub-frame construction; no vector relocation of unrelated columns.

  3. Assembly — each group's result is packed as an ObjectSeries<'C> (same shape the original fromRows path expected), so FrameUtils.fromRows still handles column-type inference exactly as before.

This replaces G × C vector relocations (G groups, C total columns) with G × A series constructions (A aggregate columns only).

Test Status

All 676 existing tests pass:

Passed! - Failed: 0, Passed: 676, Skipped: 0, Total: 676

The existing test Can aggregate rows by key pairs with missing item in pairs exercises multi-column grouping with missing values and continues to pass.

Trade-offs

  • The fromRows assembly at the end still iterates over rows; for very wide result frames with many aggregate columns there is still some overhead. A future optimisation could avoid fromRows entirely by constructing the output frame column-by-column (requiring typed column construction per column key).
  • Behaviour is otherwise identical to the previous implementation.

Generated by Repo Assist for issue #269 ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@30f2254f2a7a944da1224df45d181a3f8faefd0d

…frames

The previous implementation called NestRowsBy which materialised a full
sub-Frame (all columns, vector relocation) for every distinct group, and
then assembled the result row-by-row via fromRows. For a frame with many
rows and few distinct groups this became O(n * numColumns) in vector work.

The new implementation:
1. Extracts the group-by column values in a single pass using a
   Dictionary<obj[], int> with structural equality (HashIdentity.Structural),
   recording the ordered groups and the row offsets belonging to each group.
2. For each aggregate column, builds a sub-series directly from the source
   column's values at those offsets — no sub-frame construction needed.
3. Assembles the result rows as ObjectSeries (one per group) and hands them
   to FrameUtils.fromRows exactly as before, preserving the existing type-
   inference behaviour for result column types.

This eliminates the NestRowsBy bottleneck (which created G * C vector
relocations for G groups and C columns) and replaces it with a single O(n)
scan plus G * A series constructions for A aggregate columns.

Fixes #269

Co-authored-by: Copilot <[email protected]>
@github-actions github-actions Bot mentioned this pull request Mar 19, 2026
@dsyme dsyme marked this pull request as ready for review March 19, 2026 10:22
@dsyme dsyme merged commit 0c1305c into master Mar 19, 2026
2 checks passed
@dsyme dsyme deleted the repo-assist/fix-issue-269-aggregaterowsby-perf-ca8ad268bcd8d2b4 branch March 19, 2026 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow aggregations

1 participant