Skip to content

Provide groupByKey shortcuts for groupBy.as#213

Merged
EnricoMi merged 3 commits intomasterfrom
groupbykey
Dec 9, 2023
Merged

Provide groupByKey shortcuts for groupBy.as#213
EnricoMi merged 3 commits intomasterfrom
groupbykey

Conversation

@EnricoMi
Copy link
Copy Markdown
Contributor

@EnricoMi EnricoMi commented Dec 8, 2023

This provides shortcuts for groupBy(...).as[...] that make it easier to use column-based groupByKey.

Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

When the dataset is already partitioned and ordered by the grouping columns, Dataset.groupByKey(...) will repartition and order the entire dataset again.

Example:

Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key, while ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column id.

The new column-based groupByKey methods make it easier for users to find a way to express the grouping by expressions. Looking at the Dataset API, the user finds groupByKey with Column. The existing groupBy method returns a RelationalGroupedDataset, which provides the as[K, V] method, which allows for the same semantics, but is difficult to find.

The new column-based groupByKey methods further do not require the user to specify the type V of the original Dataset[V], as groupByKey has access to the type / encoder:

ds.groupBy($"id").as[Int, V]

vs.

ds.groupByKey[Int]($"id")

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 8, 2023

Test Results

     566 files  ±  0       566 suites  ±0   1h 29m 12s ⏱️ +11s
     536 tests +  2       536 ✔️ +  2  0 💤 ±0  0 ±0 
16 828 runs  +72  16 826 ✔️ +72  2 💤 ±0  0 ±0 

Results for commit 411afe8. ± Comparison against base commit 8314de4.

This pull request removes 28 and adds 30 tests. Note that renamed tests count towards both.
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…

♻️ This comment has been updated with latest results.

@EnricoMi EnricoMi merged commit 119c854 into master Dec 9, 2023
@EnricoMi EnricoMi deleted the groupbykey branch December 9, 2023 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant