Refactor Spark Connector into two modules for reusability #10321

cbalci · 2023-02-23T01:40:34Z

This is the first of a two PR changes which will add Spark3 support for 'pinot-spark-connector'

Background
Apache Spark has changed the Datasource interface significantly between Spark2 and Spark3 so pinot-spark-connector doesn't work for Spark3. We can implement a new connector for Spark3 as a separate module, however about half of the logic/code under the existing connector is independent of the interface and can potentially be reused across Spark2 and Spark3 connectors. For this, I'm restructuring the packages similar to what was done for batch ingestion in #8560.

Change
In this PR, I'm refactoring Spark Connector into two packages as:
pinot-spark-connector --> ( pinot-spark-common + pinot-spark-2-connector )

This is mostly a mechanical refactoring which moves packages around and renames fields/classes for clarity. Only addition is the backported (from Spark) CaseInsensitiveStringMap to make PinotDataSourceReadOptions reusable across implementations (see comment below).

Testing
Usage and functionality of the Spark2 connector should be completely unchanged except for the renaming of the maven module. All the unit tests are preseved to ensure previous assumptions. I also ran the integration tests under ExampleSparkPinotConnectorTest to verify expected behavior.

To preview the full changes including the Spark3 Connector implementation you can check this diff.

refactor cleanup
release-notes ('Pinot Spark Connector' module is renamed to 'Pinot Spark 2 Connector' for clarity)

cbalci · 2023-02-23T01:42:27Z

...k-common/src/main/java/org/apache/pinot/connector/spark/common/CaseInsensitiveStringMap.java

@@ -0,0 +1,201 @@
+/**


org.apache.spark.sql.sources.v2.DataSourceOptions, which is used by Spark to contain user specified options is removed in Spark3 in favor of a new and more capable container CaseInsensitiveStringMap, which is compatible with DataSourceOption. I copied this class from Spark codebase into our 'pinot-spark-common' package as a drop in replacement for the former and reworked the PinotDataSourceReadOptions accordingly. Now the options class can be reused by both implementations (v2 and v3) consistently.

how about make this an interface and have the wrapper of v2/v3 implementation ?

Thanks for the review @xiangfu0 !

I think this class enables the simplest interface we can expose to the implementor: Map[String,String].
If you take a look at the PinotDataSourceReadOptions below, it has two factory methods with following signatures:

object PinotDataSourceReadOptions { ... private[pinot] def from(optionsMap: util.Map[String, String]): PinotDataSourceReadOptions ... private[pinot] def from(options: CaseInsensitiveStringMap): PinotDataSourceReadOptions ... }

With this, the shared PinotDataSourceOptions object can be created either by passing a CaseInsensitiveStringMap or a plain old Map which internally will be converted to a CaseInsensitiveStringMap. I guess we can even drop the second method and only accept Map[String, String] for simplicity.

Let me know if you meant something else.

codecov-commenter · 2023-02-23T02:18:28Z

Codecov Report

Merging #10321 (94be858) into master (0f2d51c) will decrease coverage by 0.34%.
The diff coverage is 34.28%.

@@             Coverage Diff              @@
##             master   #10321      +/-   ##
============================================
- Coverage     70.42%   70.08%   -0.34%     
- Complexity     5103     5874     +771     
============================================
  Files          2017     2040      +23     
  Lines        109181   110308    +1127     
  Branches      16602    16740     +138     
============================================
+ Hits          76887    77313     +426     
- Misses        26901    27550     +649     
- Partials       5393     5445      +52

Flag	Coverage Δ
integration1	`24.55% <ø> (-0.20%)`	⬇️
integration2	`24.47% <ø> (-0.13%)`	⬇️
unittests1	`67.65% <ø> (-0.04%)`	⬇️
unittests2	`13.82% <34.28%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...pache/pinot/connector/spark/common/HttpUtils.scala	`0.00% <ø> (ø)`
.../apache/pinot/connector/spark/common/Logging.scala	`6.66% <ø> (ø)`
...ot/connector/spark/common/PinotClusterClient.scala	`5.15% <0.00%> (ø)`
...ache/pinot/connector/spark/common/exceptions.scala	`20.00% <ø> (ø)`
.../apache/pinot/connector/spark/common/package.scala	`0.00% <0.00%> (ø)`
...pinot/connector/spark/common/query/ScanQuery.scala	`57.14% <0.00%> (ø)`
...k/common/reader/PinotAbstractPartitionReader.scala	`0.00% <0.00%> (ø)`
...ark/common/reader/PinotGrpcServerDataFetcher.scala	`0.00% <0.00%> (ø)`
...r/spark/common/reader/PinotServerDataFetcher.scala	`0.00% <0.00%> (ø)`
...nnector/spark/common/CaseInsensitiveStringMap.java	`35.41% <35.41%> (ø)`
... and 156 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

xiangfu0

LGTM

KKcorps

LGTM! I was honestly worried if some new dependency can break existing spark pipelines but so far everything introduced is under provided scope. No issues here!

KKcorps · 2023-03-07T12:38:00Z

@cbalci Should I go ahead and merge this?

cbalci · 2023-03-07T18:01:41Z

Thanks for the review @KKcorps ! Yes, please go ahead. I have a couple follow up PRs waiting for this.

cbalci added 3 commits February 22, 2023 16:03

Refactor Spark Connector into two packages

3453fee

Fix styling

4b6b78f

More style fixes

94be858

cbalci commented Feb 23, 2023

View reviewed changes

xiangfu0 requested review from KKcorps and xiangfu0 February 23, 2023 19:30

xiangfu0 approved these changes Feb 23, 2023

View reviewed changes

KKcorps added release-notes Referenced by PRs that need attention when compiling the next release notes cleanup refactor labels Feb 24, 2023

KKcorps approved these changes Mar 7, 2023

View reviewed changes

KKcorps merged commit 4eeaf82 into apache:master Mar 7, 2023

cbalci mentioned this pull request Mar 8, 2023

Pinot Spark Connector for Spark3 #10394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Spark Connector into two modules for reusability #10321

Refactor Spark Connector into two modules for reusability #10321

Uh oh!

cbalci commented Feb 23, 2023

Uh oh!

cbalci Feb 23, 2023

Uh oh!

xiangfu0 Feb 23, 2023

Uh oh!

cbalci Feb 24, 2023 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 23, 2023 •

edited

Loading

Uh oh!

xiangfu0 left a comment

Uh oh!

KKcorps left a comment

Uh oh!

KKcorps commented Mar 7, 2023

Uh oh!

cbalci commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refactor Spark Connector into two modules for reusability #10321

Refactor Spark Connector into two modules for reusability #10321

Uh oh!

Conversation

cbalci commented Feb 23, 2023

Uh oh!

cbalci Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

cbalci Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

KKcorps left a comment

Choose a reason for hiding this comment

Uh oh!

KKcorps commented Mar 7, 2023

Uh oh!

cbalci commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cbalci Feb 24, 2023 •

edited

Loading

codecov-commenter commented Feb 23, 2023 •

edited

Loading