Skip to content

[Repo Assist] Add Frame.distinctRowsBy to remove duplicate rows by column values#596

Merged
dsyme merged 8 commits intomasterfrom
repo-assist/improve-frame-distinctrowsby-38e1a5bf1c7b3f41
Mar 12, 2026
Merged

[Repo Assist] Add Frame.distinctRowsBy to remove duplicate rows by column values#596
dsyme merged 8 commits intomasterfrom
repo-assist/improve-frame-distinctrowsby-38e1a5bf1c7b3f41

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Mar 9, 2026

🤖 This PR was created by Repo Assist, an automated AI assistant.

Summary

Adds Frame.distinctRowsBy, a new frame-transformation function that retains only the first row (by index order) for each unique combination of values in the specified columns.

This is analogous to SQL SELECT DISTINCT col1, col2 FROM table — a pattern that comes up regularly (see #558).

Example

// Given a frame with duplicate rows:
//   A  B   C
// 0 x  1.0 10
// 1 y  2.0 20
// 2 x  1.0 30   ← duplicate of row 0 on (A, B)
// 3 y  2.0 40   ← duplicate of row 1 on (A, B)

df |> Frame.distinctRowsBy ["A"; "B"]
//   A  B   C
// 0 x  1.0 10
// 1 y  2.0 20

C#:

df.DistinctRowsBy("A", "B")

Implementation

distinctRowsBy is built on top of the existing filterRows primitive. For each row, it computes a key as an obj list of the requested column values. F# lists have structural equality and GetHashCode() so HashSet(obj list) correctly deduplicates any combination of standard .NET value types (strings, ints, floats, DateTime, etc.).

A C#-friendly [(Extension)] DistinctRowsBy(frame, [(ParamArray)] columns) overload is added to FrameExtensions.

Test Status

  • src/Deedle/Deedle.fsprojbuilds without errors
  • tests/Deedle.Tests/Deedle.Tests.fsprojcompiles without errors
  • Unit tests cannot be executed locally (requires .NET 5, environment has .NET 8+); three new tests cover partial deduplication, no-op, and full deduplication. Please verify via CI.

Trade-offs

  • Uses obj list key rather than a typed tuple — keeps the implementation simple and avoids arity limits, at the cost of boxing value types per row per call. For large frames with many unique columns, a custom IEqualityComparer could improve performance, but this is sufficient for typical use.
  • Missing values are treated as null in the key (two rows with missing values in the same columns are considered equal).

Generated by Repo Assist ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@30f2254f2a7a944da1224df45d181a3f8faefd0d

Adds a new Frame.distinctRowsBy function that retains only the first row
for each unique combination of values in the specified columns. This is
analogous to SQL 'SELECT DISTINCT col1, col2' — a commonly requested
feature (see issue #558).

Also adds a C#-friendly DistinctRowsBy extension method that accepts
a params array of column keys.

Includes three unit tests for: partial deduplication, no-op on
already-distinct data, and full deduplication.

Co-authored-by: Copilot <[email protected]>
@dsyme
Copy link
Copy Markdown
Member

dsyme commented Mar 9, 2026

/repo-assist update the added xmldoc comment to be proper xmldoc style

Use <summary>, <param>, and <category> tags matching the
existing xmldoc conventions in FrameExtensions.fs.

Co-authored-by: Copilot <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions Bot commented Mar 9, 2026

Commit pushed: 1184e56

Generated by Repo Assist

@dsyme dsyme marked this pull request as ready for review March 12, 2026 02:16
@dsyme dsyme merged commit 9696a61 into master Mar 12, 2026
2 checks passed
@dsyme dsyme deleted the repo-assist/improve-frame-distinctrowsby-38e1a5bf1c7b3f41 branch March 12, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant