[SPARK-37980][SQL] Extend METADATA column to support row indexes for Parquet files by ala · Pull Request #37228 · apache/spark

ala · 2022-07-19T15:12:07Z

What changes were proposed in this pull request?

This change adds row_index column to _metadata struct. This column allows us to uniquely identify rows read from a given file with an index number. The n-th row in a given file with be assigned n-1 row index in every scan of the file, irrespective of file splitting and data skipping in use.

The new column requires file format specific support. This change introduces Parquet support, and other formats can follow later.

Why are the changes needed?

Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple uniquely identifies a row in a table. This information can be used to mark rows e.g. can be used to create an indexer.

Does this PR introduce any user-facing change?

Yes. With this change the customers will be able to access _metadata.row_index metadata column when reading Parquet data. The schema of _matadata column remains unchanged for file formats without row index support.

How was this patch tested?

Added FileMetadataStructSuite.scala to make sure the feature works correctly in different scenarios (supported/unsupported file format, batch/record reads, on/off heap memory...).
Added ParquetRowIndexSuite.scala to make sure the row indexes are generated correctly for Parquet file in conjunction with any combination of data skipping features.
Extended FileMetadataStructRowIndexSuite to account for new column in _metadata struct.

ala · 2022-07-20T15:14:30Z

cc @cloud-fan @Yaohua628

c21 · 2022-07-24T08:45:20Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowIndexUtil.scala

+        // a subset of rows in the row group is going to be read. Note that there is a name
+        // collision here: these row indexes (unlike ones this class is generating) are counted
+        // starting from 0 in each of the row groups.
+        rowIndexIterator = pages.getRowIndexes.get.asScala.map(idx => idx + startingRowIdx)


wondering is PageReadStore.getRowIndexes() would return continuous row indexes or not?

If the row indexes are continuous, then we don't need to store a long value per row. We can have something like RangeColumnVector(startIdx, length) to save cost to compute and store the row index vector (similar idea to ConstantColumnVector). cc @cloud-fan.

I believe this is not guaranteed. These row indexes are supposed to be produced as a result of page-level min/max filtering. So it there are 3 pages in a column chunk (intersection of row group and column) and 1st and 3rd match the predicate, but 2nd does not, we should be getting a non-continuous range of indexes.

Would it be possible to add a test for this scenario?

Yaohua628

Thanks for working on it!
Not familiar with the Parquet part, but left some questions and comments around _metadata, thanks!

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Yaohua628 · 2022-07-25T01:59:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+      val fileFormatReaderGeneratedMetadataColumns: Seq[Attribute] =
+        metadataColumns.map(_.name).flatMap {
+          case FileFormat.ROW_INDEX =>
+            Some(AttributeReference(FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME, LongType)())


a qq: what if the user schema contains a column _tmp_metadata_row_index? would it be overridden?

maybe you can create an AttributeReference with a __metadata_col in its metadata field, see the usage of __metadata_col: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala#L188 and pattern match on it afterward (just be safe)?

sadikovi

Thanks for working on this PR. I did the first pass on the changes and left a few comments.
My questions are:

Would you be able to provide performance numbers to see how row index affects reads and writes before and after the change?
I think we may need to have a way to opt out depending on the first question. Would you be able to take a look into that too?

I may need to review it again later as I might have missed context on file metadata and row index.

cc @sunchao

sadikovi · 2022-07-25T21:38:51Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java

    this.isPrimitive = column.isPrimitive();

    if (missingColumns.contains(column)) {
+      if (ParquetRowIndexUtil.isRowIndexColumn(column)) {


How expensive is this call?

It's a string comparison (so O(column name length), but in practice cheaper) that happens at most once per column per scan task. I believe it's in the same ballpark as all the other initialization happening at the beginning of the scan.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RowIndexUtil.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala

ala · 2022-07-29T10:19:57Z

@sadikovi To answer the questions about performance:

Would you be able to provide performance numbers to see how row index affects reads and writes before and after the change?

I don't quite see how this change would impact the write performance. There are no changes to the write path, only to the read path.

I think we may need to have a way to opt out depending on the first question. Would you be able to take a look into that too?

I am not really sure how the opt-out would look like. The user can always just not read the _metadata.row_index and this bypasses pretty much all the code related to row indexes. If I introduce a config to disable row indexes altogether, the result will be very similar.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

This reverts commit e44162d.

sadikovi

Would you be able to provide performance numbers for reads with and without column index or quote the difference with and without row index enabled that you have observed?

By the way, row group metadata row count may not be accurate in some instances. We have seen files where the row count was incorrect. How will the newly added code behave in this scenario?

Can you update the PR title and description to state that this is for Parquet only? It was not particularly clear from the description.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

sadikovi · 2022-08-11T06:37:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RowIndexUtil.scala

+
+
+object RowIndexUtil {
+  def findRowIndexColumnIndexInSchema(sparkSchema: StructType): Int = {


Interesting name, so it is a column index for the RowIndex column, isn't it?

Yes. If you have a different suggestion, we can change it.

sadikovi · 2022-08-11T06:38:35Z

.../test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructRowIndexSuite.scala

+  test(s"reading ${FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME} - present in a table") {
+    withReadDataFrame("parquet", extraCol = FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME) { df =>
+      // Column values are read from the file, rather than populated with generated row indexes.
+      // FIXME


Can we fix it in this PR?

@ala would you be able to create a follow-up ticket to address this so we don't forget? 🙂

ala · 2022-08-11T17:02:27Z

By the way, row group metadata row count may not be accurate in some instances. We have seen files where the row count was incorrect. How will the newly added code behave in this scenario?

@sadikovi That is really bad. I think such files should be considered corrupt data, and it is expected that features might not work correctly with them. Can you maybe point me to such files, so that I can test the exact behavior?

sadikovi

Looks good, thanks for making the changes! 👍

Regarding metadata count, that would be considered a bug in Parquet anyway. I may need some time to find those files. We can always address it in a follow-up.

ala · 2022-08-12T18:34:19Z

@sadikovi The cost of reading the row_index column is in the same ballpark as the other metadata columns:

[info] Vectorized Parquet:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] no metadata columns                                 332            370          15         15.1          66.3       1.0X
[info] _metadata.file_path                                 436            491          33         11.5          87.1       0.8X
[info] _metadata.file_name                                 440            479          20         11.4          88.0       0.8X
[info] _metadata.file_size                                 377            420          24         13.3          75.4       0.9X
[info] _metadata.file_modification_time                    391            420          19         12.8          78.1       0.8X
[info] _metadata.row_index                                 434            489          27         11.5          86.7       0.8X
[info] _metadata                                           676            766          34          7.4         135.2       0.5X

[info] Parquet-mr:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] no metadata columns                                1250           1447          78          4.0         250.0       1.0X
[info] _metadata.file_path                                1688           1898         116          3.0         337.6       0.7X
[info] _metadata.file_name                                1678           1867          87          3.0         335.6       0.7X
[info] _metadata.file_size                                1518           1711          79          3.3         303.6       0.8X
[info] _metadata.file_modification_time                   1596           1701          60          3.1         319.3       0.8X
[info] _metadata.row_index                                1526           1725          79          3.3         305.3       0.8X
[info] _metadata                                          2268           2578         134          2.2         453.5       0.6X

And these numbers are in the same ballpark as for vanilla master branch:

[info] Vectorized Parquet:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] no metadata columns                                 346            411          31         14.5          69.1       1.0X
[info] _metadata.file_path                                 452            524          49         11.1          90.5       0.8X
[info] _metadata.file_name                                 446            489          24         11.2          89.2       0.8X
[info] _metadata.file_size                                 389            436          38         12.9          77.8       0.9X
[info] _metadata.file_modification_time                    387            421          19         12.9          77.4       0.9X
[info] _metadata                                           592            672          30          8.4         118.4       0.6X

[info] Parquet-mr:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] no metadata columns                                1209           1351          73          4.1         241.8       1.0X
[info] _metadata.file_path                                1595           1807         112          3.1         318.9       0.8X
[info] _metadata.file_name                                1592           1777         100          3.1         318.3       0.8X
[info] _metadata.file_size                                1493           1692         102          3.3         298.7       0.8X
[info] _metadata.file_modification_time                   1507           1688          87          3.3         301.5       0.8X
[info] _metadata                                          1998           2238         107          2.5         399.6       0.6X

IonutBoicuAms · 2022-08-15T10:30:30Z

Can we have a committer merge this? The failing test looks unrelated to the PR.

cloud-fan · 2022-08-15T15:42:41Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

    }
+    // If needed, compute row indexes within a file.
+    if (rowIndexGenerator != null) {
+      rowIndexGenerator.populateRowIndex(columnVectors, num);


Sorry I'm not very familiar with parquet read codebase, does this happen after row group skipping or not?

Was looking through the code and it seems like it happens after row group skipping. Also, the tests in ParquetRowIndexSuite check that the rowIndex column doesn't have any of the values from the skipped row groups.

cloud-fan · 2022-08-15T15:48:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+
+  // A name for a temporary column that holds row indexes computed by the file format reader
+  // until they can be placed in the _metadata struct.
+  val ROW_INDEX_TEMPORARY_COLUMN_NAME = s"_tmp_metadata_$ROW_INDEX"


we add this name because row_index is more likely to conflict with data columns?

cloud-fan · 2022-08-15T15:51:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+      val fileFormatReaderGeneratedMetadataColumns: Seq[Attribute] =
+        metadataColumns.map(_.name).flatMap {
+          case FileFormat.ROW_INDEX =>
+            if ((readDataColumns ++ partitionColumns).map(_.name)


shall we consider case sensitivity?

cloud-fan · 2022-08-15T15:58:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+          case _ => None
+        }
+
+      val outputSchema = (readDataColumns ++ fileFormatReaderGeneratedMetadataColumns).toStructType


shall we rename it to outputDataSchema? otherwise it's a bit confusing that this is inconsistent with outputAttributes

cloud-fan · 2022-08-15T15:59:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+            case FileFormat.ROW_INDEX =>
+              fileFormatReaderGeneratedMetadataColumns
+                .filter(_.name == FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME)
+                .head.withName(FileFormat.ROW_INDEX)


nit: filter(...).head -> find(...).get

cloud-fan · 2022-08-15T16:00:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RowIndexUtil.scala

+object RowIndexUtil {
+  def findRowIndexColumnIndexInSchema(sparkSchema: StructType): Int = {
+    sparkSchema.fields.zipWithIndex.find { case (field: StructField, _: Int) =>
+      field.name == FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME


isn't it always the last column?

This is the last "data" column, in the outputAttributes, but there are also the partition and metadata columns.
Not sure about the schema, if the order in metadata columns is kept even if the users add additional metadata columns.

cloud-fan · 2022-08-15T16:05:12Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowIndexUtil.scala

+
+    override def getCurrentValue: InternalRow = {
+      val row = parent.getCurrentValue
+      row.setLong(rowIndexColumnIdx, parent.getCurrentRowIndex)


ah, so parquet-mr already provides APIs to get row index?

Yes, this was recently introduced into parquet-mr.

cloud-fan · 2022-08-16T13:01:10Z

It seems the python linter is broken in GA, cc @HyukjinKwon

ImportError: cannot import name '_unicodefun' from 'click' (/usr/local/lib/python3.9/dist-packages/click/__init__.py)
Please run 'dev/reformat-python' script.

cloud-fan · 2022-08-16T13:02:21Z

thanks, merging to master!

Add row indexes to Parquet

e68d99f

github-actions bot added SQL STRUCTURED STREAMING labels Jul 19, 2022

c21 reviewed Jul 24, 2022

View reviewed changes

Yaohua628 reviewed Jul 25, 2022

View reviewed changes

(Some of) review changes

ec98167

sadikovi reviewed Jul 25, 2022

View reviewed changes

ala added 2 commits July 27, 2022 18:45

(Some of) review changes

e120f8a

mixed case test

18f4c5a

cloud-fan reviewed Aug 1, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala Outdated Show resolved Hide resolved

ala added 5 commits August 10, 2022 09:53

Add in depth tests for name collision.

80724e0

Forgotten case

e44162d

Rename

029d9c2

Revert "Forgotten case"

6f87b59

This reverts commit e44162d.

Test using _metadata.row_index in the file or as partition by

41e2fd1

sadikovi reviewed Aug 11, 2022

View reviewed changes

ala changed the title ~~[SPARK-37980][SQL] Extend METADATA column to support row indexes~~ [SPARK-37980][SQL] Extend METADATA column to support row indexes for Parquet Aug 11, 2022

ala changed the title ~~[SPARK-37980][SQL] Extend METADATA column to support row indexes for Parquet~~ [SPARK-37980][SQL] Extend METADATA column to support row indexes for Parquet files Aug 11, 2022

More review comments

c362795

sadikovi approved these changes Aug 11, 2022

View reviewed changes

ala added 2 commits August 12, 2022 17:16

Add follow-up Jira

7308598

benchmark

b79fcbe

cloud-fan reviewed Aug 15, 2022

View reviewed changes

address Wenchen's comments

dbcc4a4

cloud-fan closed this in 95aebcb Aug 16, 2022

baibaichen mentioned this pull request Feb 5, 2025

[GLUTEN-8623][CH] Support File meta and row index for parquet apache/gluten#8624

Merged



		object RowIndexUtil {
		def findRowIndexColumnIndexInSchema(sparkSchema: StructType): Int = {

Conversation

ala commented Jul 19, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ala commented Jul 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yaohua628 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ala commented Jul 29, 2022

Uh oh!

Uh oh!

sadikovi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ala commented Aug 11, 2022

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

ala commented Aug 12, 2022

Uh oh!

IonutBoicuAms commented Aug 15, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi left a comment •

edited

Loading

cloud-fan Aug 15, 2022 •

edited

Loading

IonutBoicuAms Aug 16, 2022 •

edited

Loading