Skip to content

Conversation

@cbalci
Copy link
Contributor

@cbalci cbalci commented Feb 23, 2023

This is the first of a two PR changes which will add Spark3 support for 'pinot-spark-connector'

Background
Apache Spark has changed the Datasource interface significantly between Spark2 and Spark3 so pinot-spark-connector doesn't work for Spark3. We can implement a new connector for Spark3 as a separate module, however about half of the logic/code under the existing connector is independent of the interface and can potentially be reused across Spark2 and Spark3 connectors. For this, I'm restructuring the packages similar to what was done for batch ingestion in #8560.

Change
In this PR, I'm refactoring Spark Connector into two packages as:
pinot-spark-connector --> ( pinot-spark-common + pinot-spark-2-connector )

This is mostly a mechanical refactoring which moves packages around and renames fields/classes for clarity. Only addition is the backported (from Spark) CaseInsensitiveStringMap to make PinotDataSourceReadOptions reusable across implementations (see comment below).

Testing
Usage and functionality of the Spark2 connector should be completely unchanged except for the renaming of the maven module. All the unit tests are preseved to ensure previous assumptions. I also ran the integration tests under ExampleSparkPinotConnectorTest to verify expected behavior.

To preview the full changes including the Spark3 Connector implementation you can check this diff.

refactor cleanup
release-notes ('Pinot Spark Connector' module is renamed to 'Pinot Spark 2 Connector' for clarity)

@@ -0,0 +1,201 @@
/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.spark.sql.sources.v2.DataSourceOptions, which is used by Spark to contain user specified options is removed in Spark3 in favor of a new and more capable container CaseInsensitiveStringMap, which is compatible with DataSourceOption. I copied this class from Spark codebase into our 'pinot-spark-common' package as a drop in replacement for the former and reworked the PinotDataSourceReadOptions accordingly. Now the options class can be reused by both implementations (v2 and v3) consistently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about make this an interface and have the wrapper of v2/v3 implementation ?

Copy link
Contributor Author

@cbalci cbalci Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @xiangfu0 !

I think this class enables the simplest interface we can expose to the implementor: Map[String,String].
If you take a look at the PinotDataSourceReadOptions below, it has two factory methods with following signatures:

object PinotDataSourceReadOptions {
...
  private[pinot] def from(optionsMap: util.Map[String, String]): PinotDataSourceReadOptions
  ...
  private[pinot] def from(options: CaseInsensitiveStringMap): PinotDataSourceReadOptions
...
}

With this, the shared PinotDataSourceOptions object can be created either by passing a CaseInsensitiveStringMap or a plain old Map which internally will be converted to a CaseInsensitiveStringMap. I guess we can even drop the second method and only accept Map[String, String] for simplicity.

Let me know if you meant something else.

@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2023

Codecov Report

Merging #10321 (94be858) into master (0f2d51c) will decrease coverage by 0.34%.
The diff coverage is 34.28%.

@@             Coverage Diff              @@
##             master   #10321      +/-   ##
============================================
- Coverage     70.42%   70.08%   -0.34%     
- Complexity     5103     5874     +771     
============================================
  Files          2017     2040      +23     
  Lines        109181   110308    +1127     
  Branches      16602    16740     +138     
============================================
+ Hits          76887    77313     +426     
- Misses        26901    27550     +649     
- Partials       5393     5445      +52     
Flag Coverage Δ
integration1 24.55% <ø> (-0.20%) ⬇️
integration2 24.47% <ø> (-0.13%) ⬇️
unittests1 67.65% <ø> (-0.04%) ⬇️
unittests2 13.82% <34.28%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...pache/pinot/connector/spark/common/HttpUtils.scala 0.00% <ø> (ø)
.../apache/pinot/connector/spark/common/Logging.scala 6.66% <ø> (ø)
...ot/connector/spark/common/PinotClusterClient.scala 5.15% <0.00%> (ø)
...ache/pinot/connector/spark/common/exceptions.scala 20.00% <ø> (ø)
.../apache/pinot/connector/spark/common/package.scala 0.00% <0.00%> (ø)
...pinot/connector/spark/common/query/ScanQuery.scala 57.14% <0.00%> (ø)
...k/common/reader/PinotAbstractPartitionReader.scala 0.00% <0.00%> (ø)
...ark/common/reader/PinotGrpcServerDataFetcher.scala 0.00% <0.00%> (ø)
...r/spark/common/reader/PinotServerDataFetcher.scala 0.00% <0.00%> (ø)
...nnector/spark/common/CaseInsensitiveStringMap.java 35.41% <35.41%> (ø)
... and 156 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xiangfu0 xiangfu0 requested review from KKcorps and xiangfu0 February 23, 2023 19:30
Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KKcorps KKcorps added release-notes Referenced by PRs that need attention when compiling the next release notes cleanup refactor labels Feb 24, 2023
Copy link
Contributor

@KKcorps KKcorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I was honestly worried if some new dependency can break existing spark pipelines but so far everything introduced is under provided scope. No issues here!

@KKcorps
Copy link
Contributor

KKcorps commented Mar 7, 2023

@cbalci Should I go ahead and merge this?

@cbalci
Copy link
Contributor Author

cbalci commented Mar 7, 2023

Thanks for the review @KKcorps ! Yes, please go ahead. I have a couple follow up PRs waiting for this.

@KKcorps KKcorps merged commit 4eeaf82 into apache:master Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup refactor release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants