[SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path by parthchandra · Pull Request #34471 · apache/spark

parthchandra · 2021-11-02T22:13:05Z

What changes were proposed in this pull request?

Implements a vectorized version of the parquet reader for DELTA_BINARY_PACKED encoding
This PR includes a previous PR for this issue which passed the read request thru to the parquet implementation and which was not vectorized. The current PR builds on top of that PR (hence both are included).

Why are the changes needed?

Currently Spark throws an exception when reading data with these encodings if vectorized reader is enabled

Does this PR introduce any user-facing change?

No

How was this patch tested?

Additional unit tests for the encoding for both long and integer types (mirroring the unit tests in the Parquet implementation)

SparkQA · 2021-11-02T23:27:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49330/

SparkQA · 2021-11-03T00:12:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49330/

SparkQA · 2021-11-03T00:41:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49334/

SparkQA · 2021-11-03T00:45:13Z

Test build #144861 has finished for PR 34471 at commit fc9683b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-03T01:36:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49334/

SparkQA · 2021-11-03T02:09:14Z

Test build #144864 has finished for PR 34471 at commit 10659d3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2021-11-03T16:42:59Z

Jenkins, retest this please

SparkQA · 2021-11-03T17:26:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49350/

SparkQA · 2021-11-03T18:25:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49350/

SparkQA · 2021-11-03T19:24:13Z

Test build #144880 has finished for PR 34471 at commit 10659d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

Thanks @parthchandra for working on this! I left some comments.

sunchao · 2021-11-04T17:13:39Z

.gitignore

nit: unrelated change?

+1 with @sunchao 's comment. Please remove this from this PR.

sunchao · 2021-11-04T17:13:46Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

sunchao · 2021-11-04T21:28:24Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java

this doesn't seem to be used anywhere

But not for long (there's a horrible pun in here somewhere).
I need this for the vectorized implemenatation of DeltaByteArrayReader (which I did not include to make review easier).

in that case can we put this together with the follow-up PR?

I just knew you would say that :). Done.

sunchao · 2021-11-04T21:29:05Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java

maybe add some comments for this? what are c, rowId and val for?

sunchao · 2021-11-04T21:30:21Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

seems it's better to have an abstract class inheriting ValuesReader and VectorizedValuesReader with this default behavior defined, rather than repeating the same thing in all the different value readers.

this can be done separately though.

sunchao · 2021-11-04T22:43:55Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

why return 0 here?

Why indeed. (Intellij generated code for unimplemented methods)

sunchao · 2021-11-04T22:44:03Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

sunchao · 2021-11-04T22:46:42Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

I think we'll need to implement these too.

Oh dear. I implemented only the methods the original PR had implemented. On closer look we also need support for byte, short, date, timestamp, yearmonth interval, and daytime interval datatypes which are stored as int32 or int64.
Perf note: Rebased dates and timestamps appear to be a backward compatibility fix and incur the penalty of checking if the value needs to be rebased.

sunchao · 2021-11-04T22:49:58Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

I think this can be done more efficiently, for instance we don't need to unpack the bits anymore, and don't need to compute the original value from delta, etc.

I think we do. The original unit tests have interleaving read and skip. To continue to read after a skip, we need to have read the previous value.

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

sunchao · 2021-11-04T22:52:53Z

cc @sadikovi @viirya @dongjoon-hyun too

parthchandra · 2021-11-05T16:34:27Z

Thank you for the review @sunchao! Let me address the comments.

dongjoon-hyun · 2021-11-06T00:46:05Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

Could you use two-space indentation like the other part of this file, @parthchandra ?

dongjoon-hyun · 2021-11-06T00:50:17Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

The import order is a little strange. Could you grouping java import (line 30 and 19) together as the first group?

dongjoon-hyun · 2021-11-06T00:54:34Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

nit. Let's remove redundant empty line.

dongjoon-hyun · 2021-11-06T00:56:54Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

Let's make these two lines into a single liner.

- readValues(total, null, -1, (w, r, v) -> { - }); + readValues(total, null, -1, (w, r, v) -> {});

dongjoon-hyun · 2021-11-06T01:01:20Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

Could you put this at the beginning before int remaining = total;?

parthchandra · 2021-11-08T22:29:30Z

Looks like it may take some time to address some of the review comments. Marking this PR as draft in the meantime.

SparkQA · 2021-11-12T00:11:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49592/

SparkQA · 2021-11-12T00:52:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49593/

SparkQA · 2021-11-12T00:56:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49592/

sunchao

Thanks a lot for updating this @parthchandra ! Overall look pretty good. I think we just need to address the issue with benchmark and attach the result together with the PR. You can find out how to get benchmark result using GitHub workflow here.

sunchao · 2021-12-13T20:07:28Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

nit: maybe revise this message a bit, since "total value count is + valuesRead" looks a bit confusing.

sunchao · 2021-12-13T20:16:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala

This won't work yet because of the BooleanType added recently.

Error:

[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'sum(parquetv2table.id)' due to data type mismatch: function sum requires numeric or interval types, not boolean; line 1 pos 7; [error] 'Aggregate [unresolvedalias(sum(id#40), None)] [error] +- SubqueryAlias parquetv2table [error] +- View (`parquetV2Table`, [id#40]) [error] +- Relation [id#40] parquet

sunchao · 2021-12-13T20:16:44Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaEncodingSuite.scala

nit: remove?

sunchao · 2021-12-13T20:17:58Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaEncodingSuite.scala

nit: indentation

sunchao · 2021-12-13T20:20:51Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaEncodingSuite.scala

nit: these seem redundant

sunchao · 2021-12-13T20:22:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala

hmm why this change?

Uh. Rebase issue. Fixed.

SparkQA · 2021-12-15T00:17:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50674/

SparkQA · 2021-12-15T01:13:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50674/

SparkQA · 2021-12-15T01:28:05Z

Test build #146200 has finished for PR 34471 at commit f472135.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2021-12-20T18:36:47Z

@parthchandra could you address the unit tests failure?

parthchandra · 2021-12-21T20:00:17Z

The unit tests are failing in parts that I am not familiar with. Previously, re-running the tests had worked, but this time around the tests are failing every time. Can I get some help figuring out where the problem is?

sunchao · 2021-12-21T23:56:29Z

If you click the SparkQA test build link you should see the failed tests. For instance:

sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[no more values to read, total value count is 641]" did not equal "[No more values to read. Total values read:  641, total count: 641, trying to read 1 more.]"
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDeltaEncodingSuite.$anonfun$new$21(ParquetDeltaEncodingSuite.scala:170)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)

parthchandra · 2021-12-22T17:56:23Z

Thank you @sunchao. I had gone thru the log and failed to see the test(s) that had failed. One of the unit tests was checking for the error message that accompanied an exception and as part of the review I had changed the error message! Updated the test. The tests should pass now.

SparkQA · 2021-12-22T18:48:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50967/

SparkQA · 2021-12-22T19:41:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50967/

SparkQA · 2021-12-22T23:19:25Z

Test build #146491 has finished for PR 34471 at commit 64cc82f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

LGTM

… 2.x's Vectorized Reader

…nce Booleans/RLE is implemented)

sunchao · 2022-01-05T17:55:06Z

Committed to master branch, thanks @parthchandra !

LuciferYang · 2022-01-06T09:22:07Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala

    saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
    saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
    saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+    saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")


~~Maybe we should update the benchmark-result of DataSourceReadBenchmark~~ @parthchandra

I found that there are still unsupported encoding in Data Page V2, such as RLE for Boolean. It seems that it is not time to update the benchmark, please ignore my previous comments

@LuciferYang I was getting ready to set up a PR for the RLE/Boolean encoding and noticed that you have done so. Thank you!
Adding back the benchmark in a new PR.

LuciferYang · 2022-01-06T09:26:10Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

+  private ByteBufferInputStream in;
+
+  // temporary buffers used by readByte, readShort, readInteger, and readLong
+  byte byteVal;


Should these 4 field be private?

MaxGekk · 2022-01-06T19:56:59Z

...st/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRebaseDatetimeSuite.scala

-        withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> tsOutputType) {
+          val df = Seq.tabulate(N)(rowFunc).toDF("dict", "plain")
+            .select($"dict".cast(catalystType), $"plain".cast(catalystType))
+          withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> tsOutputType) {


Something wrong with indentation here and below:

MaxGekk · 2022-01-06T19:57:38Z

...st/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRebaseDatetimeSuite.scala

                        Seq.tabulate(N)(_ => Row(nonRebased)))
                    }
                  }
+                  }


Wrong indentation too

github-actions bot added INFRA SQL labels Nov 2, 2021

parthchandra force-pushed the SPARK-36879-PR1 branch from fc9683b to 10659d3 Compare November 2, 2021 23:49

HyukjinKwon changed the title ~~[SPARK-36879][SQL]Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path~~ [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path Nov 3, 2021

sunchao reviewed Nov 4, 2021

View reviewed changes

dongjoon-hyun reviewed Nov 6, 2021

View reviewed changes

parthchandra marked this pull request as draft November 8, 2021 22:28

parthchandra marked this pull request as ready for review November 11, 2021 23:32

parthchandra force-pushed the SPARK-36879-PR1 branch from 3572eaa to 6021d90 Compare November 11, 2021 23:35

parthchandra force-pushed the SPARK-36879-PR1 branch from 6021d90 to a88c721 Compare November 12, 2021 00:44

sunchao reviewed Dec 13, 2021

View reviewed changes

sunchao approved these changes Dec 23, 2021

View reviewed changes

nkollar and others added 9 commits January 4, 2022 10:20

[SPARK-26509][SQL] Parquet DELTA_BYTE_ARRAY is not supported in Spark…

0df2604

… 2.x's Vectorized Reader

[SPARK-36879][SQL] Vectorized implementation of DeltaBinaryPacked reader

444830b

Review comments addressed

a0fb466

More review comments addressed

1555ebb

Add benchmark.

53c3c75

more review comments addressed

7033df5

More review comments addressed. Removed benchmark (to be added back o…

92552c8

…nce Booleans/RLE is implemented)

Fix broken unit test

fffa2cb

fix rebase conflicts

6bb21f0

parthchandra force-pushed the SPARK-36879-PR1 branch from 64cc82f to 6bb21f0 Compare January 4, 2022 23:23

sunchao closed this in 32a28db Jan 5, 2022

LuciferYang reviewed Jan 6, 2022

View reviewed changes

MaxGekk reviewed Jan 6, 2022

View reviewed changes

parthchandra mentioned this pull request Jan 18, 2022

[SPARK-36879][SQL][FOLLOWUP] Address comments and fix code style #35212

Closed

sunchao mentioned this pull request Feb 15, 2022

Vectorize DeltaBitPackDecoder, up to 5x faster decoding apache/arrow-rs#1284

Merged

sfc-gh-dhuo mentioned this pull request Aug 17, 2022

[SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding #37557

Closed

Conversation

parthchandra commented Nov 2, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 2, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

dbtsai commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

SparkQA commented Nov 3, 2021

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunchao commented Nov 4, 2021

Uh oh!

parthchandra commented Nov 5, 2021

Uh oh!

Choose a reason for hiding this comment