Skip to content

[GLUTEN-8307][VL] Support Int64 Timestamp in parquet reader#8308

Merged
rui-mo merged 1 commit intoapache:mainfrom
zml1206:8307
Jan 7, 2025
Merged

[GLUTEN-8307][VL] Support Int64 Timestamp in parquet reader#8308
rui-mo merged 1 commit intoapache:mainfrom
zml1206:8307

Conversation

@zml1206
Copy link
Copy Markdown
Contributor

@zml1206 zml1206 commented Dec 23, 2024

What changes were proposed in this pull request?

facebookincubator/velox#11530 add Int64 Timestamp in parquet reader, remove config VELOX_SCAN_FILE_SCHEME_VALIDATION_ENABLED and enable related ut.
(Fixes: #8307)

How was this patch tested?

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Dec 23, 2024
@github-actions
Copy link
Copy Markdown

#8307

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhouyuan
Copy link
Copy Markdown
Member

CC @rui-mo

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Dec 23, 2024

Abnormal UT:
1.read dictionary and plain encoded timestamp_millis written as INT64

2024-12-23T07:32:06.0886442Z Error Source: RUNTIME
2024-12-23T07:32:06.0886573Z Error Code: INVALID_STATE
2024-12-23T07:32:06.0886706Z Retriable: False
2024-12-23T07:32:06.0887032Z Expression: logicalType
2024-12-23T07:32:06.0888011Z Context: Split [Hive: file:///opt/shims/spark34/spark_home/sql/core/src/test/resources/test-data/timemillis-in-i64.parquet 0 - 517] Task Gluten_Stage_874_TID_1128_VTID_54162
2024-12-23T07:32:06.0888212Z Additional Context: Operator: TableScan[0] 0
2024-12-23T07:32:06.0888364Z Function: TimestampColumnReader
2024-12-23T07:32:06.0888797Z File: /work/ep/build-velox/build/velox_ep/./velox/dwio/parquet/reader/TimestampColumnReader.h
2024-12-23T07:32:06.0888909Z Line: 99
2024-12-23T07:32:06.0889025Z Stack trace:
2024-12-23T07:32:06.0889215Z # 0  _ZN8facebook5velox7process10StackTraceC1Ei
2024-12-23T07:32:06.0889799Z # 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-12-23T07:32:06.0890514Z # 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_
2024-12-23T07:32:06.0891701Z # 3  _ZN8facebook5velox7parquet19ParquetColumnReader5buildERKSt10shared_ptrIKNS0_4TypeEERKS3_IKNS0_4dwio6common10TypeWithIdEERNS1_13ParquetParamsERNS0_6common8ScanSpecERNS0_6memory10MemoryPoolE
2024-12-23T07:32:06.0892802Z # 4  _ZN8facebook5velox7parquet18StructColumnReaderC1ERKSt10shared_ptrIKNS0_4TypeEERKS3_IKNS0_4dwio6common10TypeWithIdEERNS1_13ParquetParamsERNS0_6common8ScanSpecERNS0_6memory10MemoryPoolE
2024-12-23T07:32:06.0893975Z # 5  _ZN8facebook5velox7parquet19ParquetColumnReader5buildERKSt10shared_ptrIKNS0_4TypeEERKS3_IKNS0_4dwio6common10TypeWithIdEERNS1_13ParquetParamsERNS0_6common8ScanSpecERNS0_6memory10MemoryPoolE
2024-12-23T07:32:06.0894872Z # 6  _ZN8facebook5velox7parquet16ParquetRowReader4ImplC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.0895565Z # 7  _ZN8facebook5velox7parquet16ParquetRowReaderC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.0896148Z # 8  _ZNK8facebook5velox7parquet13ParquetReader15createRowReaderERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.0896952Z # 9  _ZN8facebook5velox9connector4hive11SplitReader15createRowReaderESt10shared_ptrINS0_6common14MetadataFilterEES4_IKNS0_7RowTypeEE
2024-12-23T07:32:06.0897887Z # 10 _ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE
2024-12-23T07:32:06.0898459Z # 11 _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
2024-12-23T07:32:06.0898660Z # 12 _ZN8facebook5velox4exec9TableScan9getOutputEv
2024-12-23T07:32:06.0899380Z # 13 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE8_clEv
2024-12-23T07:32:06.0899994Z # 14 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-12-23T07:32:06.0900340Z # 15 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.0900674Z # 16 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.0900864Z # 17 _ZN6gluten24WholeStageResultIterator4nextEv
2024-12-23T07:32:06.0901165Z # 18 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
2024-12-23T07:32:06.0901283Z # 19 0x00007fe65ea55bb0
  1. SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings
2024-12-23T07:32:06.1093694Z Error Source: RUNTIME
2024-12-23T07:32:06.1093825Z Error Code: INVALID_STATE
2024-12-23T07:32:06.1093935Z Reason: (0 vs. 0)
2024-12-23T07:32:06.1094049Z Retriable: False
2024-12-23T07:32:06.1094189Z Expression: fileOffset > 0
2024-12-23T07:32:06.1095113Z Context: Split [Hive: file:///opt/shims/spark34/spark_home/sql/core/src/test/resources/test-data/delta_length_byte_array.parquet 0 - 3072] Task Gluten_Stage_876_TID_1130_VTID_54163
2024-12-23T07:32:06.1095304Z Additional Context: Operator: TableScan[0] 0
2024-12-23T07:32:06.1095429Z Function: filterRowGroups
2024-12-23T07:32:06.1095778Z File: /work/ep/build-velox/build/velox_ep/velox/dwio/parquet/reader/ParquetReader.cpp
2024-12-23T07:32:06.1095882Z Line: 1005
2024-12-23T07:32:06.1095981Z Stack trace:
2024-12-23T07:32:06.1096154Z # 0  _ZN8facebook5velox7process10StackTraceC1Ei
2024-12-23T07:32:06.1096692Z # 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-12-23T07:32:06.1097213Z # 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
2024-12-23T07:32:06.1097545Z # 3  _ZN8facebook5velox7parquet16ParquetRowReader4Impl15filterRowGroupsEv
2024-12-23T07:32:06.1098254Z # 4  _ZN8facebook5velox7parquet16ParquetRowReader4ImplC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.1098934Z # 5  _ZN8facebook5velox7parquet16ParquetRowReaderC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.1099686Z # 6  _ZNK8facebook5velox7parquet13ParquetReader15createRowReaderERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.1100461Z # 7  _ZN8facebook5velox9connector4hive11SplitReader15createRowReaderESt10shared_ptrINS0_6common14MetadataFilterEES4_IKNS0_7RowTypeEE
2024-12-23T07:32:06.1101476Z # 8  _ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE
2024-12-23T07:32:06.1102000Z # 9  _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
2024-12-23T07:32:06.1102191Z # 10 _ZN8facebook5velox4exec9TableScan9getOutputEv
2024-12-23T07:32:06.1102819Z # 11 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE8_clEv
2024-12-23T07:32:06.1103386Z # 12 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-12-23T07:32:06.1103766Z # 13 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.1104094Z # 14 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.1104278Z # 15 _ZN6gluten24WholeStageResultIterator4nextEv
2024-12-23T07:32:06.1104609Z # 16 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
2024-12-23T07:32:06.1104738Z # 17 0x00007fe65ea55bb0

cc @rui-mo do you know the reason?

@rui-mo
Copy link
Copy Markdown
Contributor

rui-mo commented Dec 23, 2024

@zml1206 I'm glad to investigate them this week. Do you think it makes sense to hold on this PR for a while?

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Dec 23, 2024

@zml1206 I'm glad to investigate them this week. Do you think it makes sense to hold on this PR for a while?

Thank you @rui-mo

@zml1206 zml1206 marked this pull request as draft December 25, 2024 03:54
@rui-mo
Copy link
Copy Markdown
Contributor

rui-mo commented Dec 26, 2024

1.read dictionary and plain encoded timestamp_millis written as INT64

Hi @zml1206, I opened facebookincubator/velox#11964 to address above failed case. And I assume we can enable this unit test after the PR being merged.

2.SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings

This case doesn't seem to be relevant to the timestamp reader. I suggest we not enabling it in this PR. Thanks.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Dec 26, 2024

This case doesn't seem to be relevant to the timestamp reader. I suggest we not enabling it in this PR. Thanks.

facebookincubator/velox#11651, Is it the same problem? @rui-mo

@rui-mo
Copy link
Copy Markdown
Contributor

rui-mo commented Dec 26, 2024

@zml1206 Let me pick that PR to check if it is solved. Thanks for the pointer.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jan 7, 2025

@zml1206 Let me pick that PR to check if it is solved. Thanks for the pointer.

@rui-mo How is the progress on this issue?

@rui-mo
Copy link
Copy Markdown
Contributor

rui-mo commented Jan 7, 2025

@zml1206 Let me pick that PR to check if it is solved. Thanks for the pointer.

@rui-mo How is the progress on this issue?

@zml1206 I picked that PR but strangely the test SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings was not tested. Maybe it is related to my test environment.

I suggest we do not enable this test in this PR for the reasons: 1) it is not timestamp related. 2) it depends on another in-progress PR. facebookincubator/velox#11964 has been merged, and the other timestamp tests could be enabled.

@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jan 7, 2025

I suggest we do not enable this test in this PR for the reasons: 1) it is not timestamp related. 2) it depends on another in-progress PR. facebookincubator/velox#11964 has been merged, and the other timestamp tests could be enabled.

OK, I'll do it, thank you.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@zml1206 zml1206 marked this pull request as ready for review January 7, 2025 05:34
@zml1206
Copy link
Copy Markdown
Contributor Author

zml1206 commented Jan 7, 2025

@rui-mo Updated, can you help take.a look? thank you.

Copy link
Copy Markdown
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks.

@rui-mo rui-mo merged commit 1b36edc into apache:main Jan 7, 2025
@zml1206 zml1206 deleted the 8307 branch December 9, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] Support Int64 Timestamp in parquet reader

3 participants