feat(sql): add column projection pushdown for read_parquet()#6551
feat(sql): add column projection pushdown for read_parquet()#6551bluestreak01 merged 29 commits intomasterfrom
read_parquet()#6551Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis PR introduces column projection support for Parquet reads by creating a new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
read_parquet support projection push downread_parquet()
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.java (1)
261-278: Add validation to ensure projected column indexes are within parquet column count.At line 276 in
PageFrameMemoryPool.java, the code usesparquetColumnIndexto index intofromParquetColumnIndexes(sized toparquetMetadata.getColumnCount()). The current validation at line 261 only checks column count, not individual index bounds. With projection pushdown, selected column indexes can be non-consecutive. Add bounds checking:fromParquetColumnIndexes.setAll(parquetMetadata.getColumnCount(), -1); for (int i = 0, n = addressCache.getColumnCount(); i < n; i++) { final int parquetColumnIndex = addressCache.getColumnIndexes().getQuick(i); + if (parquetColumnIndex < 0 || parquetColumnIndex >= parquetMetadata.getColumnCount()) { + throw CairoException.nonCritical() + .put("parquet column index out of range [index=") + .put(parquetColumnIndex) + .put(", parquetColumnCount=") + .put(parquetMetadata.getColumnCount()) + .put(']'); + } final int columnType = addressCache.getColumnTypes().getQuick(i); parquetColumns.add(parquetColumnIndex); fromParquetColumnIndexes.setQuick(parquetColumnIndex, i); parquetColumns.add(columnType); }
🧹 Nitpick comments (6)
core/src/main/java/io/questdb/cairo/ProjectableRecordCursorFactory.java (3)
34-36: Consider adding null validation for the metadata parameter.The constructor accepts
metadatawithout validation. Adding a null check would prevent potential NPEs and make the contract clearer.🔎 Apply this diff to add null validation:
public ProjectableRecordCursorFactory(RecordMetadata metadata) { + if (metadata == null) { + throw new IllegalArgumentException("metadata cannot be null"); + } this.metadata = metadata; }
30-58: Add javadoc for this public API class.This abstract class serves as a public API but lacks documentation explaining:
- Its purpose (enabling column projection for cursor factories)
- When to extend it vs implementing
RecordCursorFactorydirectly- The lifecycle and usage contract of
setQueryProjectedMetadata()- Threading and concurrency guarantees
Adding comprehensive javadoc would significantly improve maintainability and help future contributors understand the projection mechanism.
43-53: Document the lifecycle and contract ofsetQueryProjectedMetadata().The method has no documentation and accepts null without validation. It can also be called multiple times, silently overwriting previous values. While the field is only written once per factory instance (in
SqlCodeGeneratorduring planning), the lack of documentation makes the intended usage unclear:
- Should this be called exactly once, or multiple times?
- When should it be called relative to
getCursor()?- Is null a valid value, or should it be rejected?
Consider adding javadoc explaining the expected lifecycle and, if appropriate, adding validation (e.g., rejecting null or preventing multiple calls).
core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java (2)
1196-1221: Verify that projection covers all columns used in filters / ordering for function tablesIn
generateFunctionQuery, when the table function factory is aProjectableRecordCursorFactory, you now:
- Build
queryMetaviabuildQueryMetadata(...)usingmodel.getTopDownColumns().- Call
factory.setQueryProjectedMetadata(queryMeta)so subsequentgetMetadata()calls see only projected columns.This assumes that
model.getTopDownColumns()always contains every column referenced anywhere in the query against this function (SELECT list, WHERE, GROUP BY, ORDER BY, JOIN conditions, etc.). If any column appears only in predicates or ordering and is not present intopDownColumns, then after setting the projected metadata:
- Later compilation using
factory.getMetadata()(e.g., ingenerateFilter, group‑by, order‑by) could fail withinvalid columnor mis-plan the query, since those columns would no longer exist in the metadata view.- The underlying function implementation might also not project all columns it needs to evaluate such expressions.
Please double‑check that
QueryModel.getTopDownColumns()is guaranteed to include all such referenced columns for cursor functions likeread_parquet(). If not, you likely need to extend the projection input set here (or reuse an existing “all referenced columns” collection) before pushing it intoProjectableRecordCursorFactory.
6274-6291: Refactoring table-query metadata construction intobuildQueryMetadatalooks correctThe new use of
buildQueryMetadata(...)ingenerateTableQuery0:
- Replaces the previous inline construction of
queryMeta,columnIndexes, andcolumnSizeShiftswith a shared helper.- Preserves timestamp handling via
readerTimestampIndex = getTimestampIndex(model, metadata)andrequiresTimestamp = joinsRequiringTimestamp[model.getJoinType()].- Continues to feed the resulting
columnIndexes/columnSizeShiftsinto all downstream factories (latest‑by, filter/on‑values, page‑frame, etc.) in the same way.Once the
buildQueryMetadataclearing issue is fixed as suggested above, this refactor improves cohesion without changing behavior.core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameRecordCursorFactory.java (1)
86-91: Duplicated lazy initialization forpageFrameCursor.The lazy initialization logic for
pageFrameCursoris duplicated betweengetCursor(lines 70-72) andgetPageFrameCursor(lines 86-88). This is acceptable since both entry points need the cursor, but consider extracting a helper method if this pattern expands.🔎 Optional: Extract helper method
private ReadParquetPageFrameCursor getOrCreatePageFrameCursor(SqlExecutionContext executionContext) { if (this.pageFrameCursor == null) { this.pageFrameCursor = new ReadParquetPageFrameCursor( executionContext.getCairoEngine().getConfiguration().getFilesFacade(), getMetadata() ); } return this.pageFrameCursor; }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
core/src/main/java/io/questdb/cairo/ProjectableRecordCursorFactory.java(1 hunks)core/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.java(1 hunks)core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java(5 hunks)core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameCursor.java(2 hunks)core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameRecordCursorFactory.java(3 hunks)core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java(3 hunks)core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursorFactory.java(2 hunks)core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionDecoder.java(1 hunks)core/src/test/java/io/questdb/test/griffin/ExplainPlanTest.java(3 hunks)core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionDecoder.java (1)
core/src/main/java/io/questdb/std/Chars.java (1)
Chars(43-1646)
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameCursor.java (1)
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java (1)
ReadParquetRecordCursor(68-528)
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameRecordCursorFactory.java (3)
core/src/main/java/io/questdb/cairo/ProjectableRecordCursorFactory.java (1)
ProjectableRecordCursorFactory(30-58)core/src/main/java/io/questdb/griffin/engine/table/PageFrameRecordCursorImpl.java (1)
PageFrameRecordCursorImpl(43-218)core/src/main/java/io/questdb/griffin/engine/table/PageFrameRowCursorFactory.java (1)
PageFrameRowCursorFactory(34-75)
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java (2)
core/src/main/java/io/questdb/std/IntList.java (1)
IntList(34-410)core/src/main/java/io/questdb/cairo/CairoException.java (1)
CairoException(39-429)
core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java (1)
core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionEncoder.java (1)
PartitionEncoder(39-213)
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursorFactory.java (2)
core/src/main/java/io/questdb/cairo/ProjectableRecordCursorFactory.java (1)
ProjectableRecordCursorFactory(30-58)core/src/main/java/io/questdb/std/str/Path.java (1)
Path(51-533)
core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java (1)
core/src/main/java/io/questdb/cairo/ProjectableRecordCursorFactory.java (1)
ProjectableRecordCursorFactory(30-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (34)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (Trigger Enterprise CI Trigger Enterprise Pipeline)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (Hosted Running tests on windows-other-2)
- GitHub Check: New pull request (Hosted Running tests on windows-other-1)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
- GitHub Check: New pull request (Check Changes Check changes)
🔇 Additional comments (17)
core/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.java (1)
392-418: LGTM — decode correctly handles [parquet_index, column_type] pairs.The iteration at line 406 correctly divides by 2 to account for the paired structure, and the remapping logic using
fromParquetColumnIndexesproperly translates parquet column order back to query column order. The conditional aux pointer handling for variable-size types is also correct.core/src/main/java/io/questdb/cairo/ProjectableRecordCursorFactory.java (1)
55-57: No resource cleanup needed for queryProjectMetadata.The
_close()hook doesn't clean upqueryProjectMetadataor the basemetadatafield becauseRecordMetadatais a lightweight metadata interface without resources requiring explicit cleanup. It holds only metadata information (column names, types, indexing) and is safe to leave unreleased.core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java (1)
42-42: Import ofProjectableRecordCursorFactoryis appropriateThe new import is consistent with the use of
ProjectableRecordCursorFactoryingenerateFunctionQueryand looks correct.core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionDecoder.java (1)
289-297: LGTM! Clean reverse-lookup implementation.The linear search through column names is appropriate for this use case. Column projection typically happens once during query planning, and Parquet column counts are manageable, making O(n) lookup acceptable here.
core/src/test/java/io/questdb/test/griffin/ExplainPlanTest.java (1)
7043-7071: Parquet explain plan expectations correctly reflect projected columnsThe new
columns:lines underparquet page frame scanin the threeread_parquetcases match the query’s actual column usage (all columns forSELECT *, onlya_longforavg(a_long), anda_str,a_longfora_str, max(a_long)). Formatting is consistent with other plan expectations, so these updates look good.core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameCursor.java (1)
135-139: LGTM - Clean integration of canProjectMetadata.The switch from
metadataHasChangedtocanProjectMetadatacentralizes the projection validation logic. ClearingcolumnIndexesbefore the call and lettingcanProjectMetadatapopulate it as an output parameter is a sound approach.core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursorFactory.java (2)
51-53: LGTM - Lazy initialization pattern is well-implemented.The lazy cursor creation defers
FilesFacadeacquisition to execution time, correctly retrieving it from the execution context. This aligns with the removal ofFilesFacadefrom the constructor signature and is consistent with the parallel factory's approach.
70-73: LGTM - Resource cleanup is correct.
Misc.freereturnsnullon success, properly nulling the fields to prevent double-free scenarios.core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetFunctionFactory.java (1)
94-97: LGTM - Simplified factory construction.The removal of
CairoConfigurationandFilesFacadeparameters from the call sites is consistent with the lazy initialization pattern introduced in the cursor factories. These dependencies are now obtained fromSqlExecutionContextat cursor creation time.core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetPageFrameRecordCursorFactory.java (2)
61-72: LGTM - Lazy initialization correctly implemented.The cursor and
pageFrameCursorare created on-demand using the execution context. ThePageFrameRecordCursorImplis correctly configured withentityCursor=true(appropriate for full parquet reads) andfilter=null.
110-114: LGTM - Resource cleanup is complete.All three resources (
cursor,pageFrameCursor,path) are properly freed. Note thatcursorandpageFrameCursordon't need the assignment pattern since they're not accessed after close, but consistency with thepathpattern is fine.core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java (4)
100-114: Excellent test documentation and validation.The comment clearly explains that without projection pushdown, a
SelectedRecordoperator would appear in the plan. The plan assertions correctly verify the projection is pushed down for both parallel and non-parallel modes.
119-154: Good coverage for projection with column reordering.This test validates that columns can be projected in a different order than declared in the Parquet file, which is important for the pushdown optimization.
193-229: Key test for expression handling.This test correctly verifies that when expressions (like
a_long + 1) are used, aVirtualRecordlayer is still present in the plan while the underlying parquet scan only reads the required column (a_long). This confirms the projection pushdown works correctly with expressions.
413-413: Minor schema change for test clarity.Renaming
tstots1in tableymakes the schema difference between tablesxandymore explicit, which better tests theTableReferenceOutOfDateExceptionscenario.core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java (2)
47-47: LGTM!The
IntListimport is correctly added to support the newcanProjectMetadatamethod signature.
194-202: LGTM!The column projection initialization logic is correct:
- Clears and repopulates the columns mapping based on current metadata
- Properly sets capacity for pairs (parquetIndex, actualType)
- Throws
TableReferenceOutOfDateExceptionwhen projection fails, triggering query recompilation
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java
Show resolved
Hide resolved
…rements - Add assertions to verify parquet decoder is initialized in all partition frame cursors (Fwd/Bwd, Full/Interval) - Document PartitionDecoder.of(other) lifetime requirements: the source decoder must remain valid while the copy is in use 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
[PR Coverage check]😍 pass : 277 / 289 (95.85%) file detail
|
|
Part of #6369.
Push column projection down to parquet decoder, reading only required columns instead of all columns.
Also optimized multi-threaded reads by sharing Parquet metadata across threads (parsed once per file).
Q1 Clickbench and hits.parquet on my dev environment (Mac M4)
Other
CursorFunctionCursorsdon't need this, they read on in-memory data where the existing optimizer'sSelectedRecordapproach is already efficient.