Conversation
…frames The previous implementation called NestRowsBy which materialised a full sub-Frame (all columns, vector relocation) for every distinct group, and then assembled the result row-by-row via fromRows. For a frame with many rows and few distinct groups this became O(n * numColumns) in vector work. The new implementation: 1. Extracts the group-by column values in a single pass using a Dictionary<obj[], int> with structural equality (HashIdentity.Structural), recording the ordered groups and the row offsets belonging to each group. 2. For each aggregate column, builds a sub-series directly from the source column's values at those offsets — no sub-frame construction needed. 3. Assembles the result rows as ObjectSeries (one per group) and hands them to FrameUtils.fromRows exactly as before, preserving the existing type- inference behaviour for result column types. This eliminates the NestRowsBy bottleneck (which created G * C vector relocations for G groups and C columns) and replaces it with a single O(n) scan plus G * A series constructions for A aggregate columns. Fixes #269 Co-authored-by: Copilot <[email protected]>
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 This is an automated PR from Repo Assist, created in response to the maintainer's request on issue #269.
Closes #269
What and Why
Frame.aggregateRowsBywas slow at scale because the previous implementation calledNestRowsBy, which built a full sub-Frame(materialising vector relocations for all columns) for every distinct group. The result was then assembled row-by-row viafromRows. For a 100 000-row frame with few distinct groups this was essentially O(n × numColumns) in vector work.The approach in the issue comments (
FoldColumnBy/AggregateColumnBy) pointed directly to the fix: do one pass to bucket rows by group key, then extract the aggregate data column-by-column.How
Single scan — builds a
Dictionary(obj[], int)(usingHashIdentity.Structuralfor element-wise key equality) that records the ordered groups and the row offsets belonging to each group. O(n).Per aggregate column — for each column in
aggBy, the sub-series passed toaggFuncis built directly from the source column's values at those row offsets. No sub-frame construction; no vector relocation of unrelated columns.Assembly — each group's result is packed as an
ObjectSeries<'C>(same shape the originalfromRowspath expected), soFrameUtils.fromRowsstill handles column-type inference exactly as before.This replaces G × C vector relocations (G groups, C total columns) with G × A series constructions (A aggregate columns only).
Test Status
All 676 existing tests pass:
The existing test
Can aggregate rows by key pairs with missing item in pairsexercises multi-column grouping with missing values and continues to pass.Trade-offs
fromRowsassembly at the end still iterates over rows; for very wide result frames with many aggregate columns there is still some overhead. A future optimisation could avoidfromRowsentirely by constructing the output frame column-by-column (requiring typed column construction per column key).