[SPARK-2236][SQL]SparkSQL add SkewJoin #1134

YanjieGao · 2014-06-19T12:20:26Z

Hi all,
I want to submit a SkewJoin operator in SparkSql joins.scala.

In some case ,data skew happens.SkewJoin sample the table rdd to find the largest key,then make the largest key
rows as a table rdd.The streamed rdd will be made as mainstreamedtable rdd without the largest key and the maxkeystreamedtable rdd
with the largest key.
Then,join the two table with the buildtable.
Finally,union the two result rdd.

Hi all, I want to submit a SkewJoin operator in SparkSql joins.scala. In some case ,data skew happens.SkewJoin sample the table rdd to find the largest key,then make the largest key rows as a table rdd.The streamed rdd will be made as mainstreamedtable rdd without the largest key and the maxkeystreamedtable rdd with the largest key. Then,join the two table with the buildtable. Finally,union the two result rdd.

AmplabJenkins · 2014-06-19T12:24:55Z

Can one of the admins verify this patch?

rxin · 2014-06-20T06:28:22Z

Do you mind reformatting the code to match the Spark coding style?

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

YanjieGao · 2014-06-20T06:30:30Z

Thanks a lot .I will reformat the code to match the Spark coding style.

Reformat the annotation and var name format

YanjieGao · 2014-06-21T13:58:27Z

Hi rxin,I reformat it . Can you give me some suggestions.I will try to make it better. Thanks a lot

reformat the intent

chenghao-intel · 2014-06-24T03:15:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala

This can be done like

val (maxKeyStreamedTable, mainStreamedTable) = streamedTable.partition(row => { streamSideKeyGenerator(row).toString().equals(maxrowKey.toString()) })

chenghao-intel · 2014-06-24T03:23:43Z

Skew join optimization will be very helpful, but how do we know what are skew join keys? By the way, the code will not take effect if you don't put the strategy object into a give context, (e.g. SQLContext, HiveContext)

should use "{}" for temp += 1 in a new line val (maxKeyStreamedTable, mainStreamedTable) = streamedTable.partition(row => { streamSideKeyGenerator(row).toString().equals(maxrowKey.toString()) })

YanjieGao · 2014-06-24T07:00:42Z

Thanks a lot ,Chenghao . This code like a demo ,i think we could through improve sample phrase and use some strategy to judge the which key set are skew keys. we can through absolute rate or relative rate .What's your suggestions?

YanjieGao · 2014-06-25T23:52:45Z

Hi All,I update 8 files like the pull add EXCEPT operator .But when i exec the test ,it exec case class CartesianProduct operator.I think there are some mistakes in my code .Can you help me? Thanks a lot!

YanjieGao · 2014-07-03T06:13:21Z

Hi all. I have resolve the conflict.

marmbrus · 2014-07-08T20:27:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala

A key question for this PR is if we want to model skew as a join type. My first inclination would be no, since it is a hint about how to execute the query for maximum performance and not something that changes the answer of the query.

marmbrus · 2014-07-08T20:40:09Z

I think there are major questions that will need to be answered before we could merge this PR:

Is skew just a hint instead of a join type and how do we propagate that information through?
@chenghao-intel asks a valid question about join keys. I'm not sure how this could work without them.
I think the current implementation of execute() is going to suffer from serious performance issues. It does many passes over the data, does a lot of unnecessary string manipulation and computes several Cartesian products. You will need to run some performance experiments with large datasets in order to show that this operator actually has benefits.

YanjieGao · 2014-07-11T08:38:15Z

Thanks Michael ,
(1) We could make it as a user hint ,like hive does .
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = skew_key_threshold （default = 100000）
We could use set sparksql.optimize.skewjoin=true
set sparksql.skewjoin.key=skew_key_threshold
(2)We could use sample to found the relative num of the key and though skew_key_threshold which is user set can judge which key is over the threshold
(3) toString will generate many singleton object .
,I will optimize the code in next step.

YanjieGao · 2014-07-11T08:41:34Z

Hi , I also make a left semi join .I don't know is this join as a optimization as the left semi join or as a single join algorithm. I think the 1127 PR also has some optimization need to do .Do you think this 1127 PR has it value to be merged ?Thanks a lot.
#1127

YanjieGao · 2014-07-11T16:29:58Z

Hi I rewrite the code ,and resolve some former problem

marmbrus · 2014-09-03T04:51:20Z

Hi @YanjieGao, as I said #1127 this will be a great optimization to have after we figure out how to choose join algorithms based on statistics. I think we should close this issue for now and reopen once we have a design for this.

Thanks for working on it!

YanjieGao · 2014-09-03T11:33:56Z

Hi marmbrus,I will close it. Best Regards

1. Do not hardcode bc-fips jar version, use '*' wildcard instead 2. Add FIPS-specific options to keystore verification command

YanjieGao added 2 commits June 20, 2014 16:54

update format

c8a36fd

Reformat the annotation and var name format

update format

6007fd3

Reformat the annotation and var name format

YanjieGao added 2 commits June 22, 2014 08:58

Update2

479bd83

reformat the intent

Update joins.scala

bc4b302

YanjieGao changed the title ~~SparkSQL add SkewJoin~~ [SPARK-2236][SQL]SparkSQL add SkewJoin Jun 22, 2014

YanjieGao added 5 commits June 22, 2014 23:57

Update SparkStrategies.scala

66c86bf

Update SparkStrategies.scala

67e862a

Update joins.scala

00dcbd0

Update patterns.scala

9747026

Update SparkStrategies.scala

8352e09

chenghao-intel reviewed Jun 24, 2014
View reviewed changes

update chenghao's suggestion

8a221e3

should use "{}" for temp += 1 in a new line val (maxKeyStreamedTable, mainStreamedTable) = streamedTable.partition(row => { streamSideKeyGenerator(row).toString().equals(maxrowKey.toString()) })

YanjieGao added 7 commits June 26, 2014 07:35

Update SqlParser.scala

ae22190

update join type

db8cf90

update SQLcontext

f0d1f7e

update SQLQuerySuite

773b268

update SparkStrategies

16bcb22

update patterns

80ae913

update HiveContext

89fbcce

resolve confilct in SQLQuerySuite

2af42d3

rewrite most of the code and the testsuite can pass

ba023d0

YanjieGao mentioned this pull request Jul 8, 2014

[SPARK-2235][SQL]Spark SQL basicOperator add Intersect operator #1150

Closed

YanjieGao added 2 commits July 8, 2014 11:40

resolve SQLQuerySuite conflict

c491db1

reformat the code

5beed3c

marmbrus reviewed Jul 8, 2014
View reviewed changes

YanjieGao added 10 commits July 12, 2014 00:10

rewrite the SkewJoin as SkewJoinHash

529cd78

delete the token in HiveContext

1c15f5e

delete the token in HiveQl.scala

081af43

delete the skewjoin in jointype

36c7a10

delete skew join in SQLContext

7d9a20e

remove skewjoin in sqlparser

fb33a30

Update SqlParser.scala

94e0042

Update SqlParser.scala

47d9bf1

add user hint in SQLQuerySuite

1e8a9d3

use user hint in SparkStrategies

529afff

YanjieGao closed this Sep 3, 2014

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-6324][MINOR] Support bucket skew detection (#1134)

9c158d8

udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024

SPARK-1144 Spark job fails on FIPS cluster (apache#1134)

f77b049

1. Do not hardcode bc-fips jar version, use '*' wildcard instead 2. Add FIPS-specific options to keystore verification command

mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025

SPARK-1144 Spark job fails on FIPS cluster (apache#1134)

46b887e

1. Do not hardcode bc-fips jar version, use '*' wildcard instead 2. Add FIPS-specific options to keystore verification command

[SPARK-2236][SQL]SparkSQL add SkewJoin #1134

[SPARK-2236][SQL]SparkSQL add SkewJoin #1134

Uh oh!

Conversation

YanjieGao commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

rxin commented Jun 20, 2014

Uh oh!

YanjieGao commented Jun 20, 2014

Uh oh!

YanjieGao commented Jun 21, 2014

Uh oh!

chenghao-intel Jun 24, 2014

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Jun 24, 2014

Uh oh!

YanjieGao commented Jun 24, 2014

Uh oh!

YanjieGao commented Jun 25, 2014

Uh oh!

YanjieGao commented Jul 3, 2014

Uh oh!

marmbrus Jul 8, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jul 8, 2014

Uh oh!

YanjieGao commented Jul 11, 2014

Uh oh!

YanjieGao commented Jul 11, 2014

Uh oh!

YanjieGao commented Jul 11, 2014

Uh oh!

marmbrus commented Sep 3, 2014

Uh oh!

YanjieGao commented Sep 3, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants