Provide groupByKey shortcuts for groupBy.as by EnricoMi · Pull Request #213 · G-Research/spark-extension

EnricoMi · 2023-12-08T15:13:40Z

This provides shortcuts for groupBy(...).as[...] that make it easier to use column-based groupByKey.

Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

When the dataset is already partitioned and ordered by the grouping columns, Dataset.groupByKey(...) will repartition and order the entire dataset again.

Example:

Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key, while ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column id.

The new column-based groupByKey methods make it easier for users to find a way to express the grouping by expressions. Looking at the Dataset API, the user finds groupByKey with Column. The existing groupBy method returns a RelationalGroupedDataset, which provides the as[K, V] method, which allows for the same semantics, but is difficult to find.

The new column-based groupByKey methods further do not require the user to specify the type V of the original Dataset[V], as groupByKey has access to the type / encoder:

ds.groupBy($"id").as[Int, V]

vs.

ds.groupByKey[Int]($"id")

github-actions · 2023-12-08T16:12:41Z

Test Results

    566 files ±  0     566 suites ±0 1h 29m 12s ⏱️ +11s
    536 tests +  2     536 ✔️ +  2 0 💤 ±0 0 ❌ ±0
16 828 runs +72 16 826 ✔️ +72 2 💤 ±0 0 ❌ ±0

Results for commit 411afe8. ± Comparison against base commit 8314de4.

This pull request removes 28 and adds 30 tests. Note that renamed tests count towards both.

uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…

uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…

♻️ This comment has been updated with latest results.

Provide groupByKey shortcuts for groupBy.as

1668470

EnricoMi force-pushed the groupbykey branch from 41c5c69 to 1668470 Compare December 8, 2023 18:44

EnricoMi added 2 commits December 9, 2023 17:16

Add to README.md

d9df9fc

Improve wording in README.md

411afe8

EnricoMi merged commit 119c854 into master Dec 9, 2023

EnricoMi deleted the groupbykey branch December 9, 2023 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide groupByKey shortcuts for groupBy.as#213

Provide groupByKey shortcuts for groupBy.as#213
EnricoMi merged 3 commits intomasterfrom
groupbykey

EnricoMi commented Dec 8, 2023

Uh oh!

github-actions Bot commented Dec 8, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EnricoMi commented Dec 8, 2023

Uh oh!

github-actions Bot commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Dec 8, 2023 •

edited

Loading