-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2236][SQL]SparkSQL add SkewJoin #1134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi all, I want to submit a SkewJoin operator in SparkSql joins.scala. In some case ,data skew happens.SkewJoin sample the table rdd to find the largest key,then make the largest key rows as a table rdd.The streamed rdd will be made as mainstreamedtable rdd without the largest key and the maxkeystreamedtable rdd with the largest key. Then,join the two table with the buildtable. Finally,union the two result rdd.
|
Can one of the admins verify this patch? |
|
Do you mind reformatting the code to match the Spark coding style? |
|
Thanks a lot .I will reformat the code to match the Spark coding style. |
Reformat the annotation and var name format
Reformat the annotation and var name format
|
Hi rxin,I reformat it . Can you give me some suggestions.I will try to make it better. Thanks a lot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done like
val (maxKeyStreamedTable, mainStreamedTable) = streamedTable.partition(row => {
streamSideKeyGenerator(row).toString().equals(maxrowKey.toString())
})
|
Skew join optimization will be very helpful, but how do we know what are skew join keys? By the way, the code will not take effect if you don't put the strategy object into a give context, (e.g. SQLContext, HiveContext) |
should use "{}" for temp += 1 in a new line
val (maxKeyStreamedTable, mainStreamedTable) = streamedTable.partition(row => {
streamSideKeyGenerator(row).toString().equals(maxrowKey.toString())
})
|
Thanks a lot ,Chenghao . This code like a demo ,i think we could through improve sample phrase and use some strategy to judge the which key set are skew keys. we can through absolute rate or relative rate .What's your suggestions? |
|
Hi All,I update 8 files like the pull add EXCEPT operator .But when i exec the test ,it exec case class CartesianProduct operator.I think there are some mistakes in my code .Can you help me? Thanks a lot! |
|
Hi all. I have resolve the conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A key question for this PR is if we want to model skew as a join type. My first inclination would be no, since it is a hint about how to execute the query for maximum performance and not something that changes the answer of the query.
|
I think there are major questions that will need to be answered before we could merge this PR:
|
|
Thanks Michael , |
|
Hi , I also make a left semi join .I don't know is this join as a optimization as the left semi join or as a single join algorithm. I think the 1127 PR also has some optimization need to do .Do you think this 1127 PR has it value to be merged ?Thanks a lot. |
|
Hi I rewrite the code ,and resolve some former problem |
|
Hi @YanjieGao, as I said #1127 this will be a great optimization to have after we figure out how to choose join algorithms based on statistics. I think we should close this issue for now and reopen once we have a design for this. Thanks for working on it! |
|
Hi marmbrus,I will close it. Best Regards |
1. Do not hardcode bc-fips jar version, use '*' wildcard instead 2. Add FIPS-specific options to keystore verification command
1. Do not hardcode bc-fips jar version, use '*' wildcard instead 2. Add FIPS-specific options to keystore verification command
Hi all,
I want to submit a SkewJoin operator in SparkSql joins.scala.
In some case ,data skew happens.SkewJoin sample the table rdd to find the largest key,then make the largest key
rows as a table rdd.The streamed rdd will be made as mainstreamedtable rdd without the largest key and the maxkeystreamedtable rdd
with the largest key.
Then,join the two table with the buildtable.
Finally,union the two result rdd.