-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Refactor Spark Connector into two modules for reusability #10321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,201 @@ | |||
| /** | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.spark.sql.sources.v2.DataSourceOptions, which is used by Spark to contain user specified options is removed in Spark3 in favor of a new and more capable container CaseInsensitiveStringMap, which is compatible with DataSourceOption. I copied this class from Spark codebase into our 'pinot-spark-common' package as a drop in replacement for the former and reworked the PinotDataSourceReadOptions accordingly. Now the options class can be reused by both implementations (v2 and v3) consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about make this an interface and have the wrapper of v2/v3 implementation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @xiangfu0 !
I think this class enables the simplest interface we can expose to the implementor: Map[String,String].
If you take a look at the PinotDataSourceReadOptions below, it has two factory methods with following signatures:
object PinotDataSourceReadOptions {
...
private[pinot] def from(optionsMap: util.Map[String, String]): PinotDataSourceReadOptions
...
private[pinot] def from(options: CaseInsensitiveStringMap): PinotDataSourceReadOptions
...
}
With this, the shared PinotDataSourceOptions object can be created either by passing a CaseInsensitiveStringMap or a plain old Map which internally will be converted to a CaseInsensitiveStringMap. I guess we can even drop the second method and only accept Map[String, String] for simplicity.
Let me know if you meant something else.
Codecov Report
@@ Coverage Diff @@
## master #10321 +/- ##
============================================
- Coverage 70.42% 70.08% -0.34%
- Complexity 5103 5874 +771
============================================
Files 2017 2040 +23
Lines 109181 110308 +1127
Branches 16602 16740 +138
============================================
+ Hits 76887 77313 +426
- Misses 26901 27550 +649
- Partials 5393 5445 +52
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
xiangfu0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
KKcorps
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I was honestly worried if some new dependency can break existing spark pipelines but so far everything introduced is under provided scope. No issues here!
|
@cbalci Should I go ahead and merge this? |
|
Thanks for the review @KKcorps ! Yes, please go ahead. I have a couple follow up PRs waiting for this. |
This is the first of a two PR changes which will add Spark3 support for 'pinot-spark-connector'
Background
Apache Spark has changed the Datasource interface significantly between Spark2 and Spark3 so
pinot-spark-connectordoesn't work for Spark3. We can implement a new connector for Spark3 as a separate module, however about half of the logic/code under the existing connector is independent of the interface and can potentially be reused across Spark2 and Spark3 connectors. For this, I'm restructuring the packages similar to what was done for batch ingestion in #8560.Change
In this PR, I'm refactoring Spark Connector into two packages as:
pinot-spark-connector--> (pinot-spark-common+pinot-spark-2-connector)This is mostly a mechanical refactoring which moves packages around and renames fields/classes for clarity. Only addition is the backported (from Spark)
CaseInsensitiveStringMapto makePinotDataSourceReadOptionsreusable across implementations (see comment below).Testing
Usage and functionality of the Spark2 connector should be completely unchanged except for the renaming of the maven module. All the unit tests are preseved to ensure previous assumptions. I also ran the integration tests under
ExampleSparkPinotConnectorTestto verify expected behavior.To preview the full changes including the Spark3 Connector implementation you can check this diff.
refactorcleanuprelease-notes('Pinot Spark Connector' module is renamed to 'Pinot Spark 2 Connector' for clarity)