perf(sql): avoid temp table materialization for Parquet export with computed columns#6795
perf(sql): avoid temp table materialization for Parquet export with computed columns#6795bluestreak01 merged 65 commits intomasterfrom
Conversation
Replace fragile null-check and materializer-based inference of the Parquet export mode with explicit enum dispatch in both HTTP and SQL export paths. - Add TABLE_READER enum value for direct table reference exports (COPY tablename TO), eliminating null as a sentinel - Store ParquetExportMode on HTTPSerialParquetExporter, set via setExportMode() before process(), switch on it explicitly - Replace if/else chain in SQLSerialParquetExporter.process() with exhaustive switch - Remove isPageFrameBacked field/method from RecordToColumnBuffers (was only used for the fragile inference) - Remove @nullable from CopyExportRequestTask.exportMode since all paths now set a concrete enum value Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move ownership transfer of pageFrameCursor, materializer, and materializerColumnData from state to exporter to after setUp() and setupXxxExport() succeed. Previously, state fields were nulled before these calls, so if setUp() threw (e.g., from the native createStreamingParquetWriter call), the resources would be orphaned — state couldn't free them (fields already null) and the exporter hadn't received them yet. Also add a comment in CopyExportRequestTask.clear() documenting why pageFrameCursor is nulled without Misc.free(): ownership belongs to HTTPSerialParquetExporter or ExportQueryProcessorState. Co-Authored-By: Claude Opus 4.6 <[email protected]>
For streaming export modes (DIRECT_PAGE_FRAME, PAGE_FRAME_BACKED, CURSOR_BASED), the SELECT query was compiled twice: once in CopyExportFactory.getCursor() to call determineExportMode(), and again in SQLSerialParquetExporter.processStreamingExport(). Pass the RecordCursorFactory from the first compilation through the ring queue task to the second location, removing the redundant SqlCompiler.compile() call. The factory is transferred through the ring queue using a null-transfer pattern to prevent double-free on task.clear(). Also fix a latent bug in the partition-mismatch path: when COPY specifies a different PARTITION BY than the table, determineExportMode() was skipped, leaving exportMode at its initial TABLE_READER value. The else branch now explicitly sets TEMP_TABLE for re-partitioning. Rename the two RecordCursorFactory fields in CopyExportRequestTask from "factory"/"compiledFactory" to "tempTableFactory"/"selectFactory" to clarify which export path each serves. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The two setUp() overloads duplicated column name/metadata serialization and the createStreamingParquetWriter() call. The 1-arg overload now delegates to the 3-arg version with null pfc and baseColumnMap, and the 3-arg version guards the symbol table handling with a null check on baseColumnMap. Co-Authored-By: Claude Opus 4.6 <[email protected]>
RecordToColumnBuffers.resetBuffers() used to truncate native memory buffers in place between page frames. Rust's pending_partitions still held raw pointers to the old data, so when the next frame overwrote the same addresses, Rust read corrupted data and panicked with "range end index N out of range for slice of length M". The fix replaces truncate-in-place with pin-and-replace: when buffers contain data, they are moved to pinned lists and fresh buffers are allocated. After a row group flush, an inFlightRows check determines when all pinned partitions have been drained by Rust, and the pinned buffers are freed. This applies to all four export loops (HTTP and SQL, PAGE_FRAME_BACKED and CURSOR_BASED). Tests are extended to cover var-size computed columns (STRING, VARCHAR, ARRAY) in multi-row-group scenarios across all export paths, and descending ORDER BY is forced to CURSOR_BASED mode since page frame cursors cannot reverse row order within partitions. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Fix createOp use-after-free in CopyExportRequestJob: transfer ownership out of queue task before clear(), mirroring the existing selectFactory pattern. - Fix createOp leak in CopyExportFactory when the export queue is full: hoist variable above try block and free in outer finally. - Fix RecordToColumnBuffers leak in ExportQueryProcessor CURSOR_BASED setup: wrap in try-catch matching the PAGE_FRAME_BACKED path. - Fix createParquetOp not freed in ExportQueryProcessorState.close(). - Fix createOp, tempTableFactory, pageFrameCursor not freed in CopyExportRequestTask.close(). - Fix factory not freed on PeerIsSlowToReadException in HTTPSerialParquetExporter.process(). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add a dense int[] of computed column indices to avoid scanning all outputColumnCount columns in hot loops that only need to visit computed columns. For wide tables with few computed columns (e.g., 100 columns, 2 computed), the inner loop in materializeComputedColumns() was wasting ~100M branch iterations per frame doing no work. The computedColumnIndices array is built once during setUp() and setUpPageFrameBacked(), then used in materializeComputedColumns() and both loops in resetBuffers() (buffer pinning/reallocation and STRING offset initialization). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace per-row metadata and function lookups in materializeComputedColumns() and buildColumnDataFromCursor() with flat arrays populated once at setup time: - computedOutputTypes[]: column types from adjustedMetadata - computedFunctions[]: Function instances (page-frame path) - computedIsSymbolToString[]: symbol-to-string conversion flag This eliminates three repeated indirections per row per column: adjustedMetadata.getColumnType() (3 dereferences through ObjList and TableColumnMetadata), functions.getQuick() (2 dereferences), and func.getType() (virtual dispatch). Also applies the same optimization to resetBuffers() where it re-allocates after pinning. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The three HTTP streaming export methods were logging at INFO level on every frame/batch. For tables with many partitions this produces hundreds of log lines per export. Demote the six per-frame and resume progress messages to DEBUG, keeping INFO only for start/completion messages. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Instead of freeing MemoryCARWImpl instances when pinned buffers are released after a row group flush, return them to a reuse pool. Subsequent frames draw from the pool via obtainBuffer(), which truncates and reuses an existing buffer or allocates a new one only when the pool is empty. This eliminates short-lived Java object allocation in steady state. For a 100-partition export with 5 computed columns, it avoids ~1000 MemoryCARWImpl allocations. The pool is naturally bounded by peak usage (framesPerRowGroup * buffersPerFrame) and properly freed in clear()/close(). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Pre-allocate RecordToColumnBuffers and DirectLongList as final fields instead of creating them fresh on every export request. HTTP path (ExportQueryProcessorState): materializer and materializerColumnData are now final fields initialized once. clear() calls .clear() on them; close() calls Misc.free(). ExportQueryProcessor no longer wraps setUp() in try-catch with fresh allocation. copyQueryToParquetFile() no longer nulls out the materializer fields since ownership stays with the state. HTTPSerialParquetExporter now borrows these references instead of owning them, so clearExportResources() nulls rather than freeing them. COPY path (SQLSerialParquetExporter): streamBuffers and streamColumnData are new final fields reused across exports. processStreamingExport() calls clear() instead of using try-with-resources, and processCursorExport() uses the fields directly instead of receiving them as parameters. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace the per-export lambda and long[1] array with a reusable static nested FileWriteCallback class. A single instance is held in a final field and re-initialized via of(ff, fd) for each export, eliminating the heap allocation and lambda capture per row group. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace six per-request array allocations in RecordToColumnBuffers with pre-allocated final collection fields (IntList, BoolList, ObjList). The setUp() methods now use setPos()/setQuick() to resize and populate, and clear() resets size without freeing backing arrays. Also convert the local int[] identityMap in SQLSerialParquetExporter to a reusable IntList field, and update CopyExportRequestTask.setUp() to accept IntList instead of int[]. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The clear() method nulled pageFrameCursor unconditionally before the tempTableFactory check, so the temp-table path (which owns the cursor) would try to free an already-null reference, leaking the native resource. Move the ownership check first so the cursor is properly freed when tempTableFactory is non-null, and only null-without-freeing in the non-owning paths. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add comments at both PAGE_FRAME_BACKED cast sites in ExportQueryProcessor explaining why the unchecked cast to VirtualRecordCursorFactory is safe: determineExportMode() only returns PAGE_FRAME_BACKED when the unwrapped factory is already verified to be an instanceof VirtualRecordCursorFactory. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add a class-level javadoc and per-value comments explaining when each export mode is used and what it implies for how data flows into the Parquet writer. Co-Authored-By: Claude Opus 4.6 <[email protected]>
processCursorStreamExport() and processPageFrameStreamExport() in HTTPSerialParquetExporter shared nearly identical loop bodies (writeHybridFrame, in-flight tracking, buffer pinning). Merge them into processHybridStreamExport() with a boolean branch for the data-source difference and partition release. Apply the same merge in SQLSerialParquetExporter: the inline PAGE_FRAME_BACKED block and processCursorExport() become a single processHybridExport() method. Net -43 lines with no behavioral change. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The old name described the mechanism (records → column buffers) rather than the purpose. The new name reflects the class's key design: hybrid export where pass-through columns use zero-copy page frame pointers while computed columns are materialized into buffers. Updated all 7 referencing files. Pure rename, no logic changes. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Extract the inner frame-drain loop that was duplicated between HTTPSerialParquetExporter.processHybridStreamExport() and SQLSerialParquetExporter.processHybridExport() into a shared protected method drainHybridFrames() on the base class. The two loops were identical in structure — fetch frame, materialize columns, write hybrid frame, track in-flight rows, release pinned buffers after row group flush — but differed only in which fields they operated on. The new method takes the varying pieces (exporter, materializer, column data list, pfc/cursor) as parameters. The pfc != null check replaces the former isPageFrameBacked boolean with no change in hot-path cost. Co-Authored-By: Claude Opus 4.6 <[email protected]>
When setUpPageFrameBacked() throws SqlException (via Function.init()), the exception escapes the inner try block and is caught by the outer SqlException handler. That handler sends the error response but does not free the page frame cursor already assigned to state.pageFrameCursor, leaking it until the HTTP connection resets. Add state.clear() after syntaxError() in the outer SqlException catch block to immediately release the cursor, materializer, and factory resources. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Move createOp assignment before method calls on the impl object in CopyExportFactory so the catch blocks can free it if setTableKind(), setBatchSize(), or validateAndUpdateMetadataFromSelect() throw. Remove redundant selectFactory assignment-back pattern in the same catch blocks since we throw immediately after. Remove no-op Misc.free(factory) from the PeerIsSlowToReadException catch in HTTPSerialParquetExporter — factory is provably null at that point because ownership was already transferred to the task. Add @nullable to the setWriteCallback() parameter in CopyExportRequestTask to match the @nullable field it sets. Remove redundant throws SqlException from processStreamingExport() in SQLSerialParquetExporter — all SqlExceptions are caught inside the method and wrapped in CopyExportException. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The test expected only "timeout, query aborted" but there are two racing code paths when the circuit breaker trips during PAGE_FRAME_BACKED parquet export: the page-frame factory throws "timeout, query aborted" while the HTTP exporter throws "cancelled by user". Accept either message as a valid outcome. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Eliminate the intermediate `impl` variable for CreateTableOperationImpl by assigning directly to `createOp` (narrowed from the interface type). This resolves the AutoCloseableResource warning about a resource not being in a try-with-resources block. Remove redundant Misc.free() calls from the inner catch blocks, since the outer finally block already handles freeing both `selectFactory` and `createOp` on all exception paths. This also resolves the ConstantValue warning where `selectFactory` was always null at the point of the inner catch. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The test expected only "peer disconnect" when the server aborts mid-stream due to timeout, but the client can also receive a "malformed chunk" error when chunked transfer encoding is interrupted. Accept both error messages as valid outcomes. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Apply the same fix from aa0850a to the three remaining test methods that assert on HttpClientException messages during server-side abort. When chunked transfer encoding is interrupted, the client may receive either "peer disconnect" or "malformed chunk", so both must be accepted. Affected methods: - testParquetExportCancel - testParquetExportTimeout - testParquetExportTimeoutNodelay Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
core/src/test/java/io/questdb/test/griffin/CopyExportTest.java (2)
462-503:⚠️ Potential issue | 🟡 MinorTest name no longer matches behavior. The body now asserts a successful streaming export and “no temp table,” so the “FailsWithMmapError” name is misleading. Either reintroduce the mmap-failure simulation or rename the test. Also, switch query assertions to
assertQueryNoLeakCheck.✏️ Possible rename
- public void testCopyParquetFailsWithMmapErrorCleansTempTable() throws Exception { + public void testCopyParquetStreamingExportDoesNotCreateTempTable() throws Exception {As per coding guidelines: Use assertQueryNoLeakCheck() to assert the results of queries in tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/test/java/io/questdb/test/griffin/CopyExportTest.java` around lines 462 - 503, The test method in CopyExportTest now verifies a successful streaming export and absence of temp tables, so rename the test (the method currently named with "FailsWithMmapError") to reflect success (e.g., testStreamingExportSucceedsWithoutTempTable) and update its internal assertions to use assertQueryNoLeakCheck(...) instead of assertSql(...) for the parquet read verification; leave the CopyExportRunnable, runAndFetchCopyExportID, exportRoot and engine.getTableTokens(...) checks unchanged. Ensure the new test name is used wherever referenced.
2441-2483:⚠️ Potential issue | 🟡 MinorUse underscore separators in SQL numeric literals and
assertQueryNoLeakCheck. The SQL text uses10000; please format 5+ digit numbers with underscores.📝 Suggested fix
- String insertQuery = "INSERT BATCH 10000 INTO symbol_test SELECT " + + String insertQuery = "INSERT BATCH 10_000 INTO symbol_test SELECT " +As per coding guidelines: Use assertQueryNoLeakCheck() to assert the results of queries in tests; Always use underscores as thousands separators in numbers of 5 digits or more in QuestDB SQL and implementation code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/test/java/io/questdb/test/griffin/CopyExportTest.java` around lines 2441 - 2483, Replace any plain five-plus digit numeric literals with underscore separators (e.g., change 10000 to 10_000) in the INSERT BATCH size and long_sequence call and any other occurrences; and replace assertions that use assertSql(...) for result verification with assertQueryNoLeakCheck(...) so the test uses the leak-check-free assertion method. Specifically, update the INSERT statement construction where "INSERT BATCH 10000" and "long_sequence(10000)" are built, and change the three assertSql(...) calls that validate sys.copy_export_log, count_distinct(sym) from read_parquet(...), and adjusted_value from read_parquet(...) to use assertQueryNoLeakCheck(...) while keeping the same expected/result SQL and references to runAndFetchCopyExportID, testCopyExport, CopyExportRunnable, and sqlExecutionContext.
🧹 Nitpick comments (7)
core/src/main/java/io/questdb/cutlass/parquet/ReusablePageFrameMemory.java (1)
41-48: Alphabetize private instance fields. The new class should keep private instance fields in strict alphabetical order.♻️ Suggested reorder
- private final DirectLongList pageAddresses = new DirectLongList(32, MemoryTag.NATIVE_PARQUET_EXPORTER); - private final DirectLongList pageSizes = new DirectLongList(32, MemoryTag.NATIVE_PARQUET_EXPORTER); - private int columnCount; - private boolean hasColumnTops; - private long rowIdOffset; + private int columnCount; + private boolean hasColumnTops; + private final DirectLongList pageAddresses = new DirectLongList(32, MemoryTag.NATIVE_PARQUET_EXPORTER); + private final DirectLongList pageSizes = new DirectLongList(32, MemoryTag.NATIVE_PARQUET_EXPORTER); + private long rowIdOffset;As per coding guidelines: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/cutlass/parquet/ReusablePageFrameMemory.java` around lines 41 - 48, The private instance fields in ReusablePageFrameMemory are not in alphabetical order; reorder them alphabetically to comply with the coding guideline. Specifically, within class ReusablePageFrameMemory place the private fields in this alphabetical order: auxPageAddresses, auxPageSizes, columnCount, hasColumnTops, pageAddresses, pageSizes, rowIdOffset (keep their types and MemoryTag/NATIVE_PARQUET_EXPORTER initializers unchanged). Ensure grouping by visibility remains the same (private instance fields together) and update any constructor or initialization references if reordering affects readability.core/src/main/java/io/questdb/griffin/PriorityMetadata.java (1)
68-70: Keep public methods alphabetically ordered. Please placegetVirtualColumnReservedSlots()after the otherget*methods to preserve alphabetical ordering.As per coding guidelines: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/griffin/PriorityMetadata.java` around lines 68 - 70, The method getVirtualColumnReservedSlots() is out of alphabetical order among the public getters; move its declaration so it sits alphabetically after the other public get* methods in the class (i.e., relocate getVirtualColumnReservedSlots() to the block containing the other public getters), ensuring member grouping and visibility remain unchanged and the class follows the alphabetical ordering guideline.core/src/main/java/io/questdb/cutlass/parquet/HTTPSerialParquetExporter.java (1)
240-254: Keep new public methods sorted alphabetically.The added public APIs (setExportMode/setupCursorBasedExport/setupPageFrameBackedExport) should be ordered alphabetically within the public-method block.
As per coding guidelines:**/*.java: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/cutlass/parquet/HTTPSerialParquetExporter.java` around lines 240 - 254, Public instance methods are out of alphabetical order: reorder the new public APIs so they are alphabetically sorted within the public-method block. Specifically, place setExportMode, setupCursorBasedExport, and setupPageFrameBackedExport in alphabetical order (setExportMode, setupCursorBasedExport, setupPageFrameBackedExport is already alphabetical by name—ensure they are placed in the correct public-method section and, if not, move them so the block's public instance methods follow alphabetical sorting), update any nearby public methods to maintain the overall alphabetical ordering required by the class member grouping rules.core/src/main/java/io/questdb/cutlass/parquet/HybridColumnMaterializer.java (1)
76-109: Reorder fields to satisfy member ordering rules.This new class doesn’t follow the required alphabetical ordering of instance fields. Please reorder the field block accordingly.
As per coding guidelines:**/*.java: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/cutlass/parquet/HybridColumnMaterializer.java` around lines 76 - 109, The instance fields in HybridColumnMaterializer are not ordered per project rules; reorder the field declarations so all static fields are first, then instance fields grouped by visibility (private, protected, public), and within each group sorted alphabetically; specifically adjust the ordering of DEFAULT_PAGE_SIZE, auxBuffers, baseColumnMap, bufferPool, computedBufferIdx, computedColumnIndices, computedFunctions, computedIsSymbolToString, computedOutputTypes, computedCount, dataBuffers, decimal128A, decimal256A, functionRecord, functions, hybridSymbolTableSource, outputColumnCount, pageFrameRecord, pfMemory, pinnedAuxBuffers, pinnedDataBuffers, adjustedMetadata (and any other members) so private instance fields appear alphabetically and after any static fields in the class HybridColumnMaterializer.core/src/main/java/io/questdb/cutlass/parquet/SQLSerialParquetExporter.java (1)
69-80: Keep new instance fields alphabetically ordered.The added fields (identityColumnMap/streamBuffers/streamColumnData/writeCallback) are not placed in alphabetical order within the instance-field block. Please reorder to match the project standard.
As per coding guidelines:**/*.java: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/cutlass/parquet/SQLSerialParquetExporter.java` around lines 69 - 80, The instance fields in SQLSerialParquetExporter are out of alphabetical order; reorder the instance-field block so members are alphabetized by name (e.g., configuration, copyExportRoot, exportPath, ff, fromParquet, identityColumnMap, nameSink, streamBuffers, streamColumnData, tempPath, toParquet, writeCallback) and keep their visibility/grouping intact—update the declaration order for the fields identityColumnMap, nameSink, streamBuffers, streamColumnData, writeCallback (and copyExportRoot) to match this alphabetical ordering while leaving types and initializers unchanged.core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessorState.java (1)
56-95: Reorder newly added members to preserve alphabetical grouping.The new fields and helper methods are not placed in alphabetical order within their visibility groups. Please reorder for consistency.
As per coding guidelines:**/*.java: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.Also applies to: 212-226
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessorState.java` around lines 56 - 95, The class member declarations are out of alphabetical order within their visibility groups; reorder the instance fields and helper members so they are grouped by visibility (private vs package-private) and sorted alphabetically. Specifically, move and reorder symbols such as arrayState, columnIndex, columnValueFullySent, copyExportContext, copyID, count, countRows, createParquetOp, cursor, delimiter, descending, errorMessage, errorPosition, firstParquetWriteCall, hasNext, metadata, noMeta, pageFrameCursor, parquetExportMode, parquetExportTableName, parquetFileOffset, pausedQuery, queryCacheable, queryState, record, recordCursorFactory, recordCursor, rnd, serialExporterInit, serialParquetExporter, setup fields like sqlText, skip, stop, task, timeout, writeCallback, materializer, materializerColumnData, fileName, exportModel so each visibility block is alphabetized; apply the same reordering to the other block referenced (lines ~212-226) to satisfy the coding guideline.core/src/main/java/io/questdb/cutlass/parquet/CopyExportRequestTask.java (1)
56-77: Sort new fields/getters/setters alphabetically within visibility groups.The additions (exportMode/selectFactory/selectText/tempTableFactory and accessors) should be ordered alphabetically within their member groups.
As per coding guidelines:**/*.java: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.Also applies to: 150-193, 260-289
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/io/questdb/cutlass/parquet/CopyExportRequestTask.java` around lines 56 - 77, Reorder the newly added instance fields and their accessors in CopyExportRequestTask so they follow alphabetical order within the same visibility group: move exportMode, fileName, metadata, now, nowTimestampType, pageFrameCursor, parquetVersion, rawArrayEncoding, rowGroupSize, rowGroupSize, selectFactory, selectText, statisticsEnabled, tableName, tempTableFactory, writeCallback into the correct alphabetical positions among the other private instance fields; likewise, sort the corresponding public/private getters and setters (the methods introduced around the 150-193 and 260-289 ranges) alphabetically within their visibility block so method names like getExportMode/setExportMode, getSelectFactory/setSelectFactory, getSelectText/setSelectText, getTempTableFactory/setTempTableFactory appear in proper alphabetical order relative to the other accessors in CopyExportRequestTask.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/test/java/io/questdb/test/cutlass/http/ExpParquetExportTest.java`:
- Around line 792-821: In testParquetExportCursorBasedMultipleRowGroups replace
direct engine.execute(...) DDL calls with the test helper execute(...) and
convert long SQL strings to Java text blocks; specifically, change the two
CREATE TABLE calls that currently use engine.execute(...) to use execute(sql,
sqlExecutionContext) and rewrite their SQL arguments as multiline text blocks
("""...""") instead of concatenated strings, and ensure remaining long query
strings in the queries array are also expressed as text blocks; keep the call to
assertParquetExportDataCorrectness(...) unchanged.
In `@core/src/test/java/io/questdb/test/griffin/CopyExportTest.java`:
- Around line 1374-1410: The testCopyQueryWithCursorBasedMultipleRowGroups
contains a long inline COPY SQL and uses assertSql/assertEventually for result
checks; refactor the COPY statement passed to runAndFetchCopyExportID into a
Java text block (multiline string) and replace the assertSql/assertEventually
result assertions with assertQueryNoLeakCheck to follow the test conventions.
Specifically, update the CopyExportRunnable lambda that calls
runAndFetchCopyExportID to use a """...""" text block for the COPY (including
the WITH FORMAT parquet row_group_size 100 clause) and change the two
assertSql(...) calls inside the test lambda (which currently validate
sys.copy_export_log and read_parquet(...)) to equivalent
assertQueryNoLeakCheck(...) calls while keeping the same SQL and expected
output; leave identifiers like testCopyExport, runAndFetchCopyExportID,
CopyExportRunnable, and assertEventually in place where needed.
---
Outside diff comments:
In `@core/src/test/java/io/questdb/test/griffin/CopyExportTest.java`:
- Around line 462-503: The test method in CopyExportTest now verifies a
successful streaming export and absence of temp tables, so rename the test (the
method currently named with "FailsWithMmapError") to reflect success (e.g.,
testStreamingExportSucceedsWithoutTempTable) and update its internal assertions
to use assertQueryNoLeakCheck(...) instead of assertSql(...) for the parquet
read verification; leave the CopyExportRunnable, runAndFetchCopyExportID,
exportRoot and engine.getTableTokens(...) checks unchanged. Ensure the new test
name is used wherever referenced.
- Around line 2441-2483: Replace any plain five-plus digit numeric literals with
underscore separators (e.g., change 10000 to 10_000) in the INSERT BATCH size
and long_sequence call and any other occurrences; and replace assertions that
use assertSql(...) for result verification with assertQueryNoLeakCheck(...) so
the test uses the leak-check-free assertion method. Specifically, update the
INSERT statement construction where "INSERT BATCH 10000" and
"long_sequence(10000)" are built, and change the three assertSql(...) calls that
validate sys.copy_export_log, count_distinct(sym) from read_parquet(...), and
adjusted_value from read_parquet(...) to use assertQueryNoLeakCheck(...) while
keeping the same expected/result SQL and references to runAndFetchCopyExportID,
testCopyExport, CopyExportRunnable, and sqlExecutionContext.
---
Duplicate comments:
In `@core/src/test/java/io/questdb/test/griffin/CopyExportTest.java`:
- Around line 1479-1511: The test uses assertSql to verify query results inside
testCopyQueryWithPassthroughSymbol; replace those assertSql calls with
assertQueryNoLeakCheck(...) to follow the test guideline and avoid duplicate
leak checks — specifically update the two assertions that check "SELECT
export_path, num_exported_files, status FROM \"sys.copy_export_log\" LIMIT -1"
and "SELECT * FROM read_parquet(...)" within the CopyExportRunnable/test lambda
to call assertQueryNoLeakCheck with the same expected strings and queries so the
assertions use the no-leak-check variant.
- Around line 1150-1181: Replace the duplicate assertSql query assertions inside
testCopyQueryWithArithmeticExpression with assertQueryNoLeakCheck to follow the
test guideline: update the two calls to assertSql that verify
"sys.copy_export_log" and the read_parquet output (inside the assertEventually
lambda) to use assertQueryNoLeakCheck with the same expected strings and SQL
statements so the test uses the no-leak-check variant; leave the surrounding
CopyExportRunnable, assertEventually, and testCopyExport usage unchanged.
- Around line 1306-1337: In testCopyQueryWithComputedNullValues, replace the
duplicate assertSql calls used for validating query results with
assertQueryNoLeakCheck to follow test guidelines: update the two assertSql
invocations inside the assertEventually lambda (the one asserting "SELECT
export_path, num_exported_files, status FROM \"sys.copy_export_log\" ..." and
the one asserting "SELECT * FROM read_parquet(...)" ) to use
assertQueryNoLeakCheck instead so the test uses assertQueryNoLeakCheck for query
assertions while keeping the surrounding structure (CopyExportRunnable,
runAndFetchCopyExportID, assertEventually, testCopyExport) intact.
- Around line 1412-1441: The test testCopyQueryWithEmptyComputedResult uses
assertSql for query result checks but per guidelines should use
assertQueryNoLeakCheck; replace the two assertSql(...) calls (the one asserting
"export_path\tnum_exported_files\tstatus\n..." against SELECT export_path...
FROM \"sys.copy_export_log\" and the one asserting the count from
read_parquet(...)) with assertQueryNoLeakCheck calls using the same expected
string and query; ensure you call the overload of assertQueryNoLeakCheck used
elsewhere in tests and import/adjust parameters to match the test harness.
- Around line 1259-1304: In testCopyQueryWithComputedColumnsMultipleRowGroups,
replace direct assertSql(...) calls used to validate query results inside
assertEventually with assertQueryNoLeakCheck(...) to follow the test guideline
and avoid duplicate leak checks; locate the three assertSql invocations inside
the lambda passed to assertEventually (the SELECT from "sys.copy_export_log",
the count(*) read_parquet, and the two read_parquet result checks) and call
assertQueryNoLeakCheck with the same expected string and SQL string for each
instead, leaving CopyExportRunnable, runAndFetchCopyExportID, testCopyExport,
and assertEventually usage unchanged.
- Around line 1542-1597: The test method
testCopyStreamingExportFailureCleansUpTempDir uses assertSql to assert query
results inside the assertEventually block; replace that call with
assertQueryNoLeakCheck to follow the test guideline. Locate the assertSql(...)
invocation inside the assertEventually lambda in
CopyExportTest.testCopyStreamingExportFailureCleansUpTempDir and call
assertQueryNoLeakCheck with the same expected-result string and SQL query
("SELECT status FROM \"sys.copy_export_log\" LIMIT -1"), leaving the rest of the
cleanup assertions unchanged.
- Around line 1183-1219: The test uses assertSql to validate query results
inside the assertion closure; replace those assertSql calls with
assertQueryNoLeakCheck to follow the test guideline and avoid duplicate leak
checks — specifically update the two calls to assertSql within the
assertEventually block (the "SELECT export_path, num_exported_files, status FROM
\"sys.copy_export_log\" LIMIT -1" assertion and the "SELECT * FROM
read_parquet(...)" assertion) to use assertQueryNoLeakCheck while keeping the
same expected result strings and SQL queries; no other logic changes to
testCopyExport, CopyExportRunnable, assertEventually, or read_parquet invocation
are needed.
- Around line 1221-1257: The test method
testCopyQueryWithComputedAndPassthroughColumns uses assertSql inside the
CopyExportRunnable test runnable which performs query assertions; replace those
assertSql calls with assertQueryNoLeakCheck to follow the guideline and avoid
duplicate leak checks — specifically update the two assertSql invocations (one
checking sys.copy_export_log and one reading the parquet file) inside the lambda
assigned to CopyExportRunnable test so they call assertQueryNoLeakCheck with the
same expected string and SQL query arguments.
- Around line 1443-1477: In testCopyQueryWithFullMaterialization, the query
result assertions should use assertQueryNoLeakCheck instead of assertSql
(duplicate of previous guideline); update the assertions inside the
assertEventually lambda (the two calls to assertSql that check
"sys.copy_export_log" and the read_parquet output) to call
assertQueryNoLeakCheck with the same expected string and SQL query, leaving the
surrounding test structure (CopyExportRunnable stmt/test and testCopyExport
invocation) unchanged.
- Around line 1339-1372: The test testCopyQueryWithComputedSymbolColumn uses
assertSql for asserting query results inside the assertEventually block; replace
those assertSql(...) calls with assertQueryNoLeakCheck(...) to follow the test
guideline (update both the export log assertion and the read_parquet assertion),
keeping the same SQL strings and expected results and leaving the surrounding
assertEventually, CopyExportRunnable stmt, and testCopyExport(...) usage
unchanged.
- Around line 1599-1644: In testCopyStreamingExportMoveFailureCleansUpTempDir
replace the use of assertSql(...) inside the assertEventually block with
assertQueryNoLeakCheck(...) to follow the test guideline about using
assertQueryNoLeakCheck for query assertions; locate the assertion call in the
CopyExportTest method (the assertSql that checks "SELECT status FROM
\"sys.copy_export_log\" LIMIT -1") and swap it to assertQueryNoLeakCheck with
the same expected result string so the query assertion does not trigger the
standard memory-leak check twice.
---
Nitpick comments:
In
`@core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessorState.java`:
- Around line 56-95: The class member declarations are out of alphabetical order
within their visibility groups; reorder the instance fields and helper members
so they are grouped by visibility (private vs package-private) and sorted
alphabetically. Specifically, move and reorder symbols such as arrayState,
columnIndex, columnValueFullySent, copyExportContext, copyID, count, countRows,
createParquetOp, cursor, delimiter, descending, errorMessage, errorPosition,
firstParquetWriteCall, hasNext, metadata, noMeta, pageFrameCursor,
parquetExportMode, parquetExportTableName, parquetFileOffset, pausedQuery,
queryCacheable, queryState, record, recordCursorFactory, recordCursor, rnd,
serialExporterInit, serialParquetExporter, setup fields like sqlText, skip,
stop, task, timeout, writeCallback, materializer, materializerColumnData,
fileName, exportModel so each visibility block is alphabetized; apply the same
reordering to the other block referenced (lines ~212-226) to satisfy the coding
guideline.
In `@core/src/main/java/io/questdb/cutlass/parquet/CopyExportRequestTask.java`:
- Around line 56-77: Reorder the newly added instance fields and their accessors
in CopyExportRequestTask so they follow alphabetical order within the same
visibility group: move exportMode, fileName, metadata, now, nowTimestampType,
pageFrameCursor, parquetVersion, rawArrayEncoding, rowGroupSize, rowGroupSize,
selectFactory, selectText, statisticsEnabled, tableName, tempTableFactory,
writeCallback into the correct alphabetical positions among the other private
instance fields; likewise, sort the corresponding public/private getters and
setters (the methods introduced around the 150-193 and 260-289 ranges)
alphabetically within their visibility block so method names like
getExportMode/setExportMode, getSelectFactory/setSelectFactory,
getSelectText/setSelectText, getTempTableFactory/setTempTableFactory appear in
proper alphabetical order relative to the other accessors in
CopyExportRequestTask.
In
`@core/src/main/java/io/questdb/cutlass/parquet/HTTPSerialParquetExporter.java`:
- Around line 240-254: Public instance methods are out of alphabetical order:
reorder the new public APIs so they are alphabetically sorted within the
public-method block. Specifically, place setExportMode, setupCursorBasedExport,
and setupPageFrameBackedExport in alphabetical order (setExportMode,
setupCursorBasedExport, setupPageFrameBackedExport is already alphabetical by
name—ensure they are placed in the correct public-method section and, if not,
move them so the block's public instance methods follow alphabetical sorting),
update any nearby public methods to maintain the overall alphabetical ordering
required by the class member grouping rules.
In `@core/src/main/java/io/questdb/cutlass/parquet/HybridColumnMaterializer.java`:
- Around line 76-109: The instance fields in HybridColumnMaterializer are not
ordered per project rules; reorder the field declarations so all static fields
are first, then instance fields grouped by visibility (private, protected,
public), and within each group sorted alphabetically; specifically adjust the
ordering of DEFAULT_PAGE_SIZE, auxBuffers, baseColumnMap, bufferPool,
computedBufferIdx, computedColumnIndices, computedFunctions,
computedIsSymbolToString, computedOutputTypes, computedCount, dataBuffers,
decimal128A, decimal256A, functionRecord, functions, hybridSymbolTableSource,
outputColumnCount, pageFrameRecord, pfMemory, pinnedAuxBuffers,
pinnedDataBuffers, adjustedMetadata (and any other members) so private instance
fields appear alphabetically and after any static fields in the class
HybridColumnMaterializer.
In `@core/src/main/java/io/questdb/cutlass/parquet/ReusablePageFrameMemory.java`:
- Around line 41-48: The private instance fields in ReusablePageFrameMemory are
not in alphabetical order; reorder them alphabetically to comply with the coding
guideline. Specifically, within class ReusablePageFrameMemory place the private
fields in this alphabetical order: auxPageAddresses, auxPageSizes, columnCount,
hasColumnTops, pageAddresses, pageSizes, rowIdOffset (keep their types and
MemoryTag/NATIVE_PARQUET_EXPORTER initializers unchanged). Ensure grouping by
visibility remains the same (private instance fields together) and update any
constructor or initialization references if reordering affects readability.
In `@core/src/main/java/io/questdb/cutlass/parquet/SQLSerialParquetExporter.java`:
- Around line 69-80: The instance fields in SQLSerialParquetExporter are out of
alphabetical order; reorder the instance-field block so members are alphabetized
by name (e.g., configuration, copyExportRoot, exportPath, ff, fromParquet,
identityColumnMap, nameSink, streamBuffers, streamColumnData, tempPath,
toParquet, writeCallback) and keep their visibility/grouping intact—update the
declaration order for the fields identityColumnMap, nameSink, streamBuffers,
streamColumnData, writeCallback (and copyExportRoot) to match this alphabetical
ordering while leaving types and initializers unchanged.
In `@core/src/main/java/io/questdb/griffin/PriorityMetadata.java`:
- Around line 68-70: The method getVirtualColumnReservedSlots() is out of
alphabetical order among the public getters; move its declaration so it sits
alphabetically after the other public get* methods in the class (i.e., relocate
getVirtualColumnReservedSlots() to the block containing the other public
getters), ensuring member grouping and visibility remain unchanged and the class
follows the alphabetical ordering guideline.
| @Test | ||
| public void testParquetExportCursorBasedMultipleRowGroups() throws Exception { | ||
| getExportTester() | ||
| .run((engine, sqlExecutionContext) -> { | ||
| // CROSS JOIN produces a CURSOR_BASED factory (no page frame support). | ||
| // 10 x 500 = 5000 rows with computed columns and small row groups | ||
| // exercises the buffer pinning fix in the cursor-based path. | ||
| engine.execute(""" | ||
| CREATE TABLE cb_t1 AS ( | ||
| SELECT x AS a FROM long_sequence(10) | ||
| )""", sqlExecutionContext); | ||
| engine.execute(""" | ||
| CREATE TABLE cb_t2 AS ( | ||
| SELECT x AS b, | ||
| rnd_str(5, 10, 0) AS s, | ||
| rnd_varchar(5, 10, 0) AS vc, | ||
| rnd_double_array(1, 5) AS arr | ||
| FROM long_sequence(500) | ||
| )""", sqlExecutionContext); | ||
|
|
||
| String[] queries = { | ||
| "SELECT cb_t1.a + cb_t2.b AS sum_ab, cb_t2.s FROM cb_t1 CROSS JOIN cb_t2", | ||
| "SELECT cb_t1.a * cb_t2.b AS product, cb_t2.s FROM cb_t1 CROSS JOIN cb_t2", | ||
| // VARCHAR and ARRAY through cursor-based buffers | ||
| "SELECT cb_t1.a + cb_t2.b AS sum_ab, cb_t2.vc FROM cb_t1 CROSS JOIN cb_t2", | ||
| "SELECT cb_t1.a + cb_t2.b AS sum_ab, cb_t2.arr FROM cb_t1 CROSS JOIN cb_t2", | ||
| }; | ||
|
|
||
| assertParquetExportDataCorrectness(engine, sqlExecutionContext, queries, queries.length * 3, 200); | ||
| }); |
There was a problem hiding this comment.
Use test helpers for DDL and prefer text blocks for long SQL.
New tests still use engine.execute(...) for DDL and build long SQL via string concatenation (e.g., Line 1489 and Line 1597). Please switch to execute() for non-queries and use multiline string literals for longer SQL statements.
As per coding guidelines: **/src/test/java/**/*.java: Use execute() to run non-queries (DDL) in tests; Use multiline strings for longer SQL statements (multiple INSERT rows, complex queries) and to assert multiline query results in tests.
Also applies to: 1405-1657
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/test/java/io/questdb/test/cutlass/http/ExpParquetExportTest.java`
around lines 792 - 821, In testParquetExportCursorBasedMultipleRowGroups replace
direct engine.execute(...) DDL calls with the test helper execute(...) and
convert long SQL strings to Java text blocks; specifically, change the two
CREATE TABLE calls that currently use engine.execute(...) to use execute(sql,
sqlExecutionContext) and rewrite their SQL arguments as multiline text blocks
("""...""") instead of concatenated strings, and ensure remaining long query
strings in the queries array are also expressed as text blocks; keep the call to
assertParquetExportDataCorrectness(...) unchanged.
| @Test | ||
| public void testCopyQueryWithCursorBasedMultipleRowGroups() throws Exception { | ||
| // CURSOR_BASED: CROSS JOIN produces a non-page-frame factory. | ||
| // Small row_group_size forces multiple batches and exercises buffer pinning. | ||
| assertMemoryLeak(() -> { | ||
| execute("CREATE TABLE cb_t1 AS (SELECT x AS a FROM long_sequence(10))"); | ||
| execute(""" | ||
| CREATE TABLE cb_t2 AS ( | ||
| SELECT x AS b, | ||
| rnd_str(5, 10, 0) AS s, | ||
| rnd_varchar(5, 10, 0) AS vc, | ||
| rnd_double_array(1, 5) AS arr | ||
| FROM long_sequence(500) | ||
| )"""); | ||
|
|
||
| // All columns materialized through buffers in cursor mode: | ||
| // STRING (s), VARCHAR (vc), and ARRAY (arr) exercise var-size buffer pinning | ||
| CopyExportRunnable stmt = () -> | ||
| runAndFetchCopyExportID( | ||
| "COPY (SELECT cb_t1.a + cb_t2.b AS sum_ab, cb_t2.s, cb_t2.vc, cb_t2.arr FROM cb_t1 CROSS JOIN cb_t2) TO 'cb_rg_output' WITH FORMAT parquet row_group_size 100", | ||
| sqlExecutionContext | ||
| ); | ||
|
|
||
| CopyExportRunnable test = () -> | ||
| assertEventually(() -> { | ||
| assertSql( | ||
| "export_path\tnum_exported_files\tstatus\n" + | ||
| exportRoot + File.separator + "cb_rg_output.parquet\t1\tfinished\n", | ||
| "SELECT export_path, num_exported_files, status FROM \"sys.copy_export_log\" LIMIT -1" | ||
| ); | ||
| assertSql("count\n5000\n", | ||
| "SELECT count(*) FROM read_parquet('" + exportRoot + File.separator + "cb_rg_output.parquet')"); | ||
| }); | ||
|
|
||
| testCopyExport(stmt, test); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Use a text block for the complex COPY query; align assertions with assertQueryNoLeakCheck. The CROSS JOIN COPY statement is long and would be clearer as a multiline string.
♻️ Example refactor
- CopyExportRunnable stmt = () ->
- runAndFetchCopyExportID(
- "COPY (SELECT cb_t1.a + cb_t2.b AS sum_ab, cb_t2.s, cb_t2.vc, cb_t2.arr FROM cb_t1 CROSS JOIN cb_t2) TO 'cb_rg_output' WITH FORMAT parquet row_group_size 100",
- sqlExecutionContext
- );
+ CopyExportRunnable stmt = () ->
+ runAndFetchCopyExportID(
+ """
+ COPY (
+ SELECT cb_t1.a + cb_t2.b AS sum_ab,
+ cb_t2.s,
+ cb_t2.vc,
+ cb_t2.arr
+ FROM cb_t1 CROSS JOIN cb_t2
+ ) TO 'cb_rg_output' WITH FORMAT PARQUET row_group_size 100
+ """,
+ sqlExecutionContext
+ );As per coding guidelines: Use assertQueryNoLeakCheck() to assert the results of queries in tests; use multiline strings for longer statements (multiple INSERT rows, complex queries).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/test/java/io/questdb/test/griffin/CopyExportTest.java` around lines
1374 - 1410, The testCopyQueryWithCursorBasedMultipleRowGroups contains a long
inline COPY SQL and uses assertSql/assertEventually for result checks; refactor
the COPY statement passed to runAndFetchCopyExportID into a Java text block
(multiline string) and replace the assertSql/assertEventually result assertions
with assertQueryNoLeakCheck to follow the test conventions. Specifically, update
the CopyExportRunnable lambda that calls runAndFetchCopyExportID to use a
"""...""" text block for the COPY (including the WITH FORMAT parquet
row_group_size 100 clause) and change the two assertSql(...) calls inside the
test lambda (which currently validate sys.copy_export_log and read_parquet(...))
to equivalent assertQueryNoLeakCheck(...) calls while keeping the same SQL and
expected output; leave identifiers like testCopyExport, runAndFetchCopyExportID,
CopyExportRunnable, and assertEventually in place where needed.
In CURSOR_BASED mode, buildColumnDataFromCursor() dispatched writeColumnValue() using the adjusted output type (STRING for SYMBOL columns). This caused record.getStrA(col) to be called instead of record.getSymA(col), producing incorrect data or errors. Track the original (pre-adjustment) column types in a separate computedSourceTypes list and use them for writeColumnValue() dispatch, so SYMBOL columns correctly go through the getSymA() path. The adjusted output types are still used for buffer management and addColumnData(). Co-Authored-By: Claude Opus 4.6 <[email protected]>
The test was not exercising the temp table export path because the streaming export path (introduced on this branch) handled the query without creating a temp table. Restore the original FilesFacade mmap-failure injection from master, and change the table schema to include a BINARY column with a WHERE clause to force the TEMP_TABLE export mode. This ensures the test actually triggers temp table creation, mmap failure, and cleanup. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move determineExportMode(), hasComputedBinaryColumn(), and unwrapFactory() from HybridColumnMaterializer to ParquetExportMode. These are general-purpose dispatch methods that determine which export strategy to use — they don't belong on one specific strategy implementation. Also fold the descending-order fallback (PAGE_FRAME_BACKED to CURSOR_BASED) into determineExportMode() via a new isDescending parameter, eliminating duplicated post-hoc fixups in ExportQueryProcessor. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The three setUp() overloads in StreamPartitionParquetExporter shared ~80% of their structure (buffer init, column iteration, SYMBOL handling, createStreamingParquetWriter call). The parameterless setUp() was a full copy of the three-arg version with slightly different SYMBOL resolution and writer index logic. Collapse into a single implementation that the other two delegate to. The key insight is that meta.getWriterIndex(i) already returns the correct value for all paths: the physical writer column index for table metadata, and i for GenericRecordMetadata (hybrid/cursor paths). The baseColumnMap nullity distinguishes "all columns are real table columns" from "only mapped columns are pass-through". Co-Authored-By: Claude Opus 4.6 <[email protected]>
Both HTTPSerialParquetExporter.process() and SQLSerialParquetExporter.process() had nearly identical finally blocks for dropping the temporary table used in TEMP_TABLE export mode: set phase, update status, resolve table token, drop table, and handle CairoException/Throwable with identical log-and-update logic. Extract this into a protected dropTempTable() method on the base class. The SQL subclass previously used its fromParquet Path field for the drop call; the shared method uses Path.getThreadLocal() instead, which is equivalent since dropTableOrViewOrMatView consumes the path immediately. The HTTP-specific freeOwnedPageFrameCursor() call remains at the call site since it is not part of the drop logic. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Extract shared state and methods from HTTPSerialParquetExporter into a new abstract BaseParquetExporter class. Both HTTPSerialParquetExporter and SQLSerialParquetExporter now extend this base directly. The base class holds fields (copyExportContext, insertSelectReporter, sqlExecutionContext, circuitBreaker, task) and methods (of(), dropTempTable(), drainHybridFrames()) that both export paths need. HTTP-specific streaming state (exportMode, fullCursor, materializer, materializerColumnData, streamingPfc) stays in HTTPSerialParquetExporter where it belongs. This removes the awkward dependency where SQLSerialParquetExporter extended HTTPSerialParquetExporter despite not using any of its HTTP-specific fields or methods. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The identity column map [0, 1, 2, ..., n-1] was rebuilt on every DIRECT_PAGE_FRAME export invocation. Since the field is an instance variable and its contents are deterministic for a given size, guard the rebuild with a size check so it is only constructed once per distinct column count. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace three separate catch blocks (SqlException, CairoException, Throwable) in HTTPSerialParquetExporter.process() with a single catch (Throwable) that uses pattern variables to extract the error message and errno per type. All three blocks performed identical cleanup: log, free factory, clear export resources, update status, and rethrow — differing only in how they obtained the message and errno. Co-Authored-By: Claude Opus 4.6 <[email protected]>
| assert exportMode != null; | ||
| switch (exportMode) { | ||
| case PAGE_FRAME_BACKED, CURSOR_BASED -> processHybridStreamExport(); | ||
| case DIRECT_PAGE_FRAME, TABLE_READER, TEMP_TABLE -> processStreamExport(); |
There was a problem hiding this comment.
we should handle default case as a failsafe
bluestreak01
left a comment
There was a problem hiding this comment.
Code Review
Design & Architecture
The mode detection in ParquetExportMode.determineExportMode() is clean and well-reasoned:
DIRECT_PAGE_FRAME(zero-copy) →PAGE_FRAME_BACKED(hybrid) →CURSOR_BASED(full materialization) →TEMP_TABLE(fallback)
The BINARY column exclusion from the hybrid/cursor paths is the right call given the unbounded data size. The BaseParquetExporter extraction cleanly separates HTTP vs SQL concerns while sharing drainHybridFrames() and dropTempTable().
The zero-copy claim is valid for pass-through columns in buildColumnDataFromPageFrame() — page frame addresses are read directly with no data copying. Buffer pinning/pooling prevents use-after-free while minimizing allocation. Resource cleanup ordering is correct (Rust writer closed before native buffers freed). The ownership model (state owns, exporter borrows) is clear.
Issues
1. Field ordering violation in HybridColumnMaterializer
HybridColumnMaterializer.java:103-104 — adjustedMetadata and hybridSymbolTableSource are declared after pinnedDataBuffers/pinnedAuxBuffers, but alphabetically they should come first among the instance fields (before auxBuffers).
2. Pinned buffer release heuristic lacks invariant documentation
BaseParquetExporter.java:89-93:
if (inFlightRows <= rowCount) {
mat.releasePinnedBuffers();
}This relies on a non-obvious invariant: that when the in-flight row count drops to at most one frame's worth, the Rust writer has consumed all data from the previously pinned buffers. A comment documenting why this is safe (e.g., "Rust processes frames FIFO and rowsWrittenToRowGroups covers all frames up to the written row group boundary") would be valuable.
3. writeColumnValue() and writeComputedValue() are near-duplicates
HybridColumnMaterializer.java:522-575 and 577-625 are two large switch statements covering every column type — one operates on Record + column index, the other on Function + Record. Any new column type must be added to both. A comment noting they must stay in sync would help prevent future drift.
4. SYMBOL handling asymmetry between the two writeXxx methods
In writeColumnValue() (line 536-540), SYMBOL is explicitly handled — calling record.getSymA(col) and converting to STRING. But in writeComputedValue(), there is no SYMBOL case. The materializeComputedColumns() method (line 447) handles the symbol-to-string case separately via computedIsSymbolToString, calling func.getSymbol() outside of writeComputedValue(). This works, but the asymmetry makes it easy to overlook that writeComputedValue() is intentionally missing the SYMBOL branch. If someone were to call writeComputedValue() with a SYMBOL type directly, it would hit the default branch and throw. Worth a comment.
5. PR description — active voice
The CLAUDE.md says to use active voice. A few passive constructions in the PR description:
- "Per-column metadata are pre-computed into flat arrays at setup time" → "
setUp()pre-computes per-column metadata into flat arrays" - "Computed SYMBOL columns are converted to STRING" → "The materializer converts computed SYMBOL columns to STRING"
6. computedCount reset in resetBuffers()
HybridColumnMaterializer.java:494-500 — computedCount is temporarily zeroed and then re-incremented through allocateBuffer() during the pin-and-reallocate path. This is correct but subtle. A brief comment explaining why computedCount is temporarily zeroed would help.
Resource Management
Resource cleanup ordering is correct: ExportQueryProcessorState.clear() closes the Rust streaming writer (via task.clear()) before freeing materializer buffers (via materializer.clear()), preventing use-after-free.
Test Coverage
Well covered:
- Computed columns (CAST, arithmetic, string functions)
- NULL values, SYMBOL-to-STRING conversion
- Multiple row groups, cursor-based mode, column tops
Missing or weak:
PeerIsSlowToReadExceptionresume in hybrid mode — the HTTP resume codepath forprocessHybridStreamExport()/processCursorBasedExport()is untested.- NULL values spanning row group boundaries — existing NULL tests use small data fitting one row group.
- Cursor-based mode with NULLs across row groups — not tested.
I've pushed tests covering the latter two gaps on this branch (5 new tests, all green):
CopyExportTest#testCopyQueryWithComputedNullValuesMultipleRowGroupsCopyExportTest#testCopyQueryWithCursorBasedNullsMultipleRowGroupsCopyExportTest#testCopyQueryWithComputedColumnsDescendingExpParquetExportTest#testParquetExportPageFrameComputedColumnsNullsExpParquetExportTest#testParquetExportPageFrameCursorBasedNulls
Add 5 new tests covering gaps identified during code review: - CopyExportTest#testCopyQueryWithComputedNullValuesMultipleRowGroups: PAGE_FRAME_BACKED with periodic NULLs in computed columns spanning multiple row group flushes (row_group_size 50, 1000 rows). - CopyExportTest#testCopyQueryWithCursorBasedNullsMultipleRowGroups: CURSOR_BASED (CROSS JOIN) with NULL sentinels for both fixed-size and var-size columns across row group boundaries. - CopyExportTest#testCopyQueryWithComputedColumnsDescending: PAGE_FRAME_BACKED with computed columns and small row_group_size forcing multiple row group flushes, verifying data correctness. - ExpParquetExportTest#testParquetExportPageFrameComputedColumnsNulls: HTTP export with periodic NULLs in computed columns, randomized row_group_size via assertParquetExportDataCorrectness(). - ExpParquetExportTest#testParquetExportPageFrameCursorBasedNulls: HTTP cursor-based export with NULLs in CROSS JOIN results. Co-Authored-By: Claude Opus 4.6 <[email protected]>
[PR Coverage check]😍 pass : 10 / 11 (90.91%) file detail
|
…omputed columns (#6795)
Fixes #6766
Summary
COPY ... TOand HTTP/exp) no longer falls back to a temporary table when the projection contains computed expressions (e.g.,CAST, arithmetic, string functions). Instead, a new hybrid export mode passes through raw page-frame-backed columns zero-copy and materializes only the computed columns into native buffers, row by row per frame.PARTITION BYoverride) or that contain a computedBINARYcolumn still use the temp table path.moveFile()failure, use-after-free on error path cleanup, Rust streaming writer close ordering relative to pinned native buffers, and SYMBOL column dispatch in cursor-based mode (was using the adjusted output type instead of the source type for record accessor dispatch).BaseParquetExporterholds shared state and methods (dropTempTable(),drainHybridFrames()), whileHTTPSerialParquetExporterandSQLSerialParquetExporterextend it independently. This removes the prior dependency where the SQL exporter extended the HTTP exporter despite not using any HTTP-specific fields.Design
ParquetExportMode.determineExportMode()inspects the compiledRecordCursorFactorybefore constructing any temp table:DIRECT_PAGE_FRAMEPageFrameCursordirectlyPAGE_FRAME_BACKEDVirtualRecordCursorFactorywhose base supports page frames, no computed BINARY columnsCURSOR_BASEDRecordCursorTEMP_TABLEPARTITION BYoverride or computed BINARY columnsThe
ParquetExportModeenum owns the mode selection logic (determineExportMode()), including the descending-order fallback fromPAGE_FRAME_BACKEDtoCURSOR_BASED. This avoids duplicated post-hoc fixups and re-applies the fallback correctly afterTableReferenceOutOfDateExceptionrecompilation.HybridColumnMaterializerhandles bothPAGE_FRAME_BACKEDandCURSOR_BASEDmodes. For each output column it either copies page-frame pointers directly into the 7-longs-per-column layout that the Rust Parquet writer expects, or evaluates the column'sFunctionagainst a record and writes results intoMemoryCARWImpldata/aux buffers. The materializer converts computed SYMBOL columns to STRING in the adjusted metadata but dispatches them via source type for correct record accessor usage.Because Rust's Parquet writer holds references to native memory across multiple frame writes (pending row-group data), the materializer pins compute buffers that Rust may still reference and allocates fresh ones, releasing pinned buffers only after the Rust side confirms a row-group flush. A buffer pool avoids reallocation overhead.
setUp()pre-computes per-column metadata (types, functions, symbol-to-string flags) and computed-column indices once into flat arrays to eliminate repeated indirection in hot loops.Mode selection happens before constructing
CreateTableOperation, which eliminates an extra query compilation pass for all non-temp-table cases. The ring queue passes the compiledRecordCursorFactoryto the exporter, avoiding redundantSqlCompiler.compile()calls.Exporter hierarchy
A single
StreamPartitionParquetExporter.setUp()implementation serves all export modes. ThebaseColumnMapnullity distinguishes pass-through vs. all-real-column paths, andmeta.getWriterIndex(i)returns the correct value for both table metadata andGenericRecordMetadata.Test plan
CopyExportTestandExpParquetExportTestsuites pass