Skip to content

perf(parquet): speed up parquet writing#6735

Merged
bluestreak01 merged 37 commits intomasterfrom
feat-parquet-export-speedup
Feb 5, 2026
Merged

perf(parquet): speed up parquet writing#6735
bluestreak01 merged 37 commits intomasterfrom
feat-parquet-export-speedup

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Feb 2, 2026

Summary

This PR speeds up Parquet export through two optimizations:

  1. SIMD-accelerated encoding for nullable INT/LONG columns using portable SIMD intrinsics
  2. Memory-efficient streaming mode using madvise hints to prevent page cache exhaustion under memory pressure

Changes

SIMD Encoding (Rust)

  • Added portable_simd feature for cross-platform SIMD support
  • New i32_slice_to_page_simd and i64_slice_to_page_simd functions that vectorize definition level encoding for nullable columns
  • Processes 64 values per iteration using SIMD comparison and bit manipulation

Streaming Mode (Java)

Introduces streaming mode for TableReader that applies two Linux kernel hints to memory-mapped partition files:

Hint When Applied Effect
MADV_SEQUENTIAL On mmap Enables aggressive read-ahead; kernel prefetches upcoming pages
MADV_DONTNEED Before munmap Immediately releases page cache; prevents memory exhaustion

Why both hints are needed:

Scenario Behavior Result
No hints + memory pressure Default access pattern, no prefetch optimization, page cache fills up causing eviction storms 82 MB/s
SEQUENTIAL only Better prefetching but cache still accumulates, eventually triggers memory pressure 214 MB/s
SEQUENTIAL + DONTNEED Optimal prefetching AND immediate cache release, memory stays available for new partitions 387 MB/s

Implementation:

  • MADV_SEQUENTIAL applied in MemoryCMRImpl.map() and TableUtils when opening column files
  • MADV_DONTNEED applied in closePartitionColumn() and MappedMemoryPartitionDescriptor.clear() before munmap()
  • Streaming mode bypasses MmapCache via mmapNoCache() so each partition mapping is independent and fully releasable
  • Added TableReader.setStreamingMode() and PageFrameCursor.setStreamingMode() APIs
  • Parquet exporters (SQLSerialParquetExporter, CopyExportRequestTask) automatically enable streaming mode

Performance Results

SIMD Encoding (no memory pressure)

Metric Before After Improvement
LZ4RAW
Elapsed time 37.73s 32.40s -14.1%
CPU time 22.09s 17.68s -20.0%
CPU utilization 58.6% 54.6% -4.0pp
Input throughput 467 MB/s 544 MB/s +16.5%
Output throughput 381 MB/s 444 MB/s +16.5%
UNCOMPRESSED
Elapsed time 29.79s 29.42s -1.2%
CPU time 17.31s 14.38s -16.9%
CPU utilization 58.1% 48.9% -9.2pp
Input throughput 592 MB/s 599 MB/s +1.2%
Output throughput 544 MB/s 550 MB/s +1.3%

LZ4RAW sees the biggest gains (~16% faster throughput, 20% less CPU). Uncompressed is I/O-bound so wall time is similar, but CPU usage dropped ~17%.

madvise Hints (under memory pressure)

Test Environment:

  • Ubuntu 24.04, 32 GB RAM
  • Dataset: 17 GB table with 1.1 GB partitions

Test Methodology:

  1. Start memory pressure using stress-ng to consume 28 GB, leaving ~4 GB free (CPU stress disabled)
  2. Start QuestDB
  3. Clear Linux page cache: sync && echo 3 > /proc/sys/vm/drop_caches
  4. Run Parquet export

This sequence ensures cold-cache conditions under realistic memory pressure, simulating a production server with other workloads consuming memory.

Configuration Throughput (MB/s) vs. No Hints vs. Baseline
No memory pressure (baseline) 410.73 100%
No madvise hints 82.59 1.0× 20%
MADV_SEQUENTIAL only 213.85 2.6× 52%
MADV_SEQUENTIAL + MADV_DONTNEED 387.14 4.7× 94%

Under memory pressure, the combination of both hints recovers 94% of baseline performance.
The benchmark export scirpt is attached:

bench_parquet_export.sh

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 2, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Walkthrough

Implements streaming Parquet export with partition release and SIMD optimizations. Adds explicit partition release mechanisms via releaseOpenPartitions() and streaming mode via setStreamingMode() across the PageFrameCursor hierarchy. Introduces no-cache memory mapping with madvise hints for page-cache management. Replaces primitive Parquet encoding with SIMD-accelerated variants in Rust. Includes benchmark script and comprehensive tests.

Changes

Cohort / File(s) Summary
Rust SIMD Encoding
qdbr/src/lib.rs, qdbr/src/parquet_write/mod.rs, qdbr/src/parquet_write/primitive.rs, qdbr/src/parquet_write/file.rs
Enables portable_simd feature and replaces primitive page-encoding calls (int/long/float/double types) with SIMD-accelerated variants (i32_slice_to_page_simd, i64_slice_to_page_simd, f32/f64 variants).
Streaming Writer Progress Tracking
qdbr/src/parquet_write/jni.rs
Adds rows_written_to_row_groups field to StreamingParquetWriter; tracks cumulative rows via header updates during flush/finish.
PageFrameCursor Streaming API
cairo/sql/PageFrameCursor.java, griffin/engine/QueryProgress.java, griffin/engine/table/SelectedRecordCursorFactory.java, griffin/engine/table/ExtraNullColumnCursorFactory.java
Introduces releaseOpenPartitions() and setStreamingMode(boolean) default methods; forwards to base cursor in QueryProgress and factory wrappers.
Partition Release Tracking
griffin/engine/table/FwdTableReaderPageFrameCursor.java, griffin/engine/table/BwdTableReaderPageFrameCursor.java
Implements explicit partition release: lowestOpenPartitionIndex (forward) and highestOpenPartitionIndex (backward) to track and close partitions outside current frame range.
TableReader Streaming Mode
cairo/TableReader.java, griffin/engine/table/TablePageFrameCursor.java
Adds setStreamingMode(boolean) and closePartitionByIndex(int); applies madvise options during partition open/close; adjusts max-open-partitions logic when streaming.
No-Cache Memory Mapping
std/Files.java, std/FilesFacade.java, std/FilesFacadeImpl.java, cairo/TableUtils.java
Introduces POSIX_MADV_DONTNEED constant and no-cache variants: mmapNoCache(), mremapNoCache() (skip MmapCache); adds mapRONoCache() utilities in TableUtils.
Memory Mapping Bypass
cairo/vm/MemoryCMRImpl.java, cairo/vm/MemoryCMRDetachedImpl.java, std/MmapCache.java
Routes mmap/mremap through NoCache variants when madviseOpts != -1; updates MemoryCMRDetachedImpl constructor to accept madviseOpts; adjusts isSingleUse logic for zero counts.
Partition Munmap Hints
cairo/TableWriter.java, griffin/engine/table/parquet/MappedMemoryPartitionDescriptor.java
Replaces mapRO with mapRONoCache + MADV_SEQUENTIAL in TableWriter partition conversion; adds madvise(MADV_DONTNEED) before munmap in MappedMemoryPartitionDescriptor.
Parquet Export Integration
cutlass/parquet/CopyExportRequestTask.java, cutlass/parquet/HTTPSerialParquetExporter.java, cutlass/parquet/SQLSerialParquetExporter.java
Enables streaming mode on readers/cursors; tracks rowsWrittenToRowGroups via buffer header; calls releaseOpenPartitions() after row-group writes and closePartitionByIndex() per-partition.
Test Fixtures & Helpers
test/cutlass/http/TestHttpClient.java, test/cairo/fuzz/FailureFileFacade.java, core/src/main/c/linux/files.c, std/Os.java
Adds Unicode-escape logging in TestHttpClient; adds mmapNoCache/mremapNoCache overrides in FailureFileFacade; adds JNI getPosixMadvDontneed and debug Rust-load logging.
Test Updates & New Tests
test/cutlass/http/ExpParquetExportTest.java, test/griffin/CopyExportTest.java, test/griffin/engine/table/PageFrameCursorReleasePartitionTest.java, test/griffin/engine/table/parquet/StreamingParquetBenchmarkTest.java
Updates expected Parquet export byte counts/sizes due to SIMD changes; adds PageFrameCursorReleasePartitionTest with forward/backward release scenarios; adds StreamingParquetBenchmarkTest for read/export throughput measurement.
Benchmark Script
benchmark_parquet_export.sh
Bash script to benchmark Parquet export performance: creates/populates table, flushes page cache, exports to Parquet, measures elapsed time, file sizes, and throughput (MB/s).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

Performance, Core, SQL

Suggested reviewers

  • bluestreak01
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective: improving Parquet writing performance. All major changes relate to this goal.
Description check ✅ Passed The PR description clearly relates to the changeset, detailing SIMD-accelerated encoding and streaming mode optimizations for Parquet export with specific performance benchmarks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-parquet-export-speedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ideoma ideoma marked this pull request as ready for review February 3, 2026 10:42
@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Feb 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@core/src/main/java/io/questdb/griffin/engine/table/BwdTableReaderPageFrameCursor.java`:
- Around line 157-165: The releaseOpenPartitions() method in
BwdTableReaderPageFrameCursor must guard against being called before next() by
checking highestOpenPartitionIndex for an uninitialized value; add an early
return when highestOpenPartitionIndex is -1 (or otherwise <=
reenterPartitionIndex) so the loop does not iterate with invalid indices, i.e.
perform a conditional check at the start of releaseOpenPartitions() and only run
the for-loop that calls reader.closePartitionByIndex(i) when
highestOpenPartitionIndex >= 0 and > reenterPartitionIndex, then leave
highestOpenPartitionIndex set to reenterPartitionIndex.

In
`@core/src/test/java/io/questdb/test/griffin/engine/table/parquet/StreamingParquetBenchmarkTest.java`:
- Around line 223-273: The test calls ThreadMXBean.getCurrentThreadCpuTime()
unguarded which can return -1 or be unsupported; modify the ThreadMXBean usage
in StreamingParquetBenchmarkTest so you first call
threadMXBean.isCurrentThreadCpuTimeSupported() and only call
getCurrentThreadCpuTime() (and compute cpuTimeNs/cpuUtilization) when
supported—also try to enable with threadMXBean.setThreadCpuTimeEnabled(true) if
supported but disabled; ensure when unsupported you skip CPU-time math (set
cpuTimeNs/cpuUtilization to a sentinel or skip emitting CPU metrics) and keep
elapsed time/bytes/rows calculations intact so
totalSum/totalBytesRead/totalRows/frameCount logic remains unchanged.
🧹 Nitpick comments (6)
core/src/main/java/io/questdb/cairo/vm/MemoryCMRImpl.java (1)

174-174: Guard madvise calls for consistency and clarity.

Lines 174 and 188 call ff.madvise() unconditionally with madviseOpts, even when it equals -1. While FilesFacadeImpl.madvise() safely ignores -1 (checks advise > -1), this is inconsistent with line 183, which guards the mapping call with if (madviseOpts != -1). Add guards for clarity:

Suggested fix
         size = newSize;
-        ff.madvise(pageAddress, size, madviseOpts);
+        if (madviseOpts != -1) {
+            ff.madvise(pageAddress, size, madviseOpts);
+        }

And in map() at line 188:

                     this.pageAddress = TableUtils.mapRO(ff, fd, size, memoryTag);
                 }
-                ff.madvise(pageAddress, size, madviseOpts);
+                if (madviseOpts != -1) {
+                    ff.madvise(pageAddress, size, madviseOpts);
+                }
core/src/main/java/io/questdb/cairo/sql/PageFrameCursor.java (1)

147-149: Consider adding Javadoc for releaseOpenPartitions().

The setStreamingMode method has comprehensive documentation, but releaseOpenPartitions() lacks any description of its purpose, when to call it, or what partitions it releases. For API consistency and discoverability, consider adding similar documentation.

📝 Suggested documentation
+    /**
+     * Releases partitions that have already been processed and are no longer needed.
+     * This is useful during streaming operations to reduce memory pressure by allowing
+     * the underlying TableReader to close partitions that won't be accessed again.
+     * <p>
+     * Default implementation is a no-op. Subclasses backed by TableReader should override
+     * this method to release partitions up to (but not including) the current partition.
+     */
     default void releaseOpenPartitions() {
         // no-op by default
     }
core/src/main/java/io/questdb/griffin/engine/table/FwdTableReaderPageFrameCursor.java (1)

156-163: Verify behavior when releaseOpenPartitions() is called before next().

If releaseOpenPartitions() is called before next(), reenterPartitionIndex may be uninitialized (default 0). In this case, the loop for (int i = 0; i < 0; i++) would not execute, which is safe. However, for consistency with the backward cursor and defensive coding, consider adding a guard or documenting the expected usage.

The current implementation is safe because the loop won't execute when lowestOpenPartitionIndex == reenterPartitionIndex == 0, but adding explicit documentation or a guard would make the contract clearer:

 `@Override`
 public void releaseOpenPartitions() {
+    // Close all partitions from lowestOpenPartitionIndex up to (but not including) current partition
+    // Safe to call before next() - loop simply won't execute
     for (int i = lowestOpenPartitionIndex; i < reenterPartitionIndex; i++) {
         reader.closePartitionByIndex(i);
     }
     lowestOpenPartitionIndex = reenterPartitionIndex;
 }
core/src/test/java/io/questdb/test/griffin/engine/table/parquet/StreamingParquetBenchmarkTest.java (3)

117-166: Allow re-running the benchmark without rebuilding the table.

ensureTableExists always creates and inserts, so re-running in the same DB root will fail or redo a long insert. Consider checking for an existing table and skipping data generation (or truncating) when it already exists.


168-191: Consider soft-failing when cache drop isn’t permitted.

Throwing on non-root or non-Linux makes the benchmark unusable in those environments. Logging a warning and continuing (or making this opt-in) keeps the benchmark runnable while still documenting limitations.


369-416: Release open partitions per frame to match streaming semantics.

The streaming benchmark keeps partitions open for the full scan. Calling releaseOpenPartitions() (or enabling streaming mode) after each frame keeps the open-partition count bounded and better reflects production streaming behavior.

♻️ Suggested release after each frame
@@
                             totalRows += frameRowCount;
                             frameCount++;
+                            pageFrameCursor.releaseOpenPartitions();
                         }

@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit any updates? Find bugs pls

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 4, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@bluestreak01, I'll conduct a thorough analysis to identify potential bugs and issues. Let me examine the critical areas of this PR.


🏁 Script executed:

#!/bin/bash
# 1. Check for unsafe Rust code blocks in SIMD implementation
echo "=== Checking for unsafe blocks in Rust SIMD code ==="
fd -e rs -p 'qdbr.*simd' --exec cat {} \; | head -200

echo -e "\n=== Checking SIMD primitive implementation ==="
rg -A 5 "unsafe" core/rust/qdbr/src/parquet_write/simd.rs | head -100

Length of output: 9708


🏁 Script executed:

#!/bin/bash
# 2. Analyze partition close/release logic for potential race conditions
echo "=== Checking partition close implementation ==="
ast-grep --pattern 'public void closePartitionByIndex($$$) {
  $$$
}'

echo -e "\n=== Checking releaseOpenPartitions implementations ==="
ast-grep --pattern 'public void releaseOpenPartitions() {
  $$$
}'

Length of output: 1088


🏁 Script executed:

#!/bin/bash
# 3. Check madvise usage and timing - ensure it's called safely
echo "=== Checking madvise call sites ==="
rg -n "madvise|MADV_DONTNEED" --type java -C 3

echo -e "\n=== Checking for potential use-after-madvise scenarios ==="
rg -n "closePartition|madvise" core/src/main/java/io/questdb/cairo/TableReader.java -C 5

Length of output: 62016


🏁 Script executed:

#!/bin/bash
# 4. Check streaming mode state management and reset logic
echo "=== Checking streamingMode flag usage and reset ==="
rg -n "streamingMode\s*=" --type java -C 3

echo -e "\n=== Checking toTop() implementations for state reset ==="
ast-grep --pattern 'public void toTop() {
  $$$
}'

Length of output: 9087


🏁 Script executed:

#!/bin/bash
# 5. Check error handling in new code paths
echo "=== Checking try-catch blocks in new streaming export code ==="
rg -n "closePartitionByIndex|releaseOpenPartitions" --type java -B 2 -A 5 | grep -E "(try|catch|finally)" -C 3

echo -e "\n=== Checking for missing null checks in partition release ==="
rg -n "closePartitionByIndex|getOpenPartitionCount" core/src/main/java/io/questdb/cairo/TableReader.java -C 5

Length of output: 3597


🏁 Script executed:

#!/bin/bash
# 6. Look for potential off-by-one errors in partition index tracking
echo "=== Checking partition index arithmetic ==="
rg -n "lowestOpenPartitionIndex|highestOpenPartitionIndex" --type java -C 3

echo -e "\n=== Checking boundary conditions ==="
rg -n "partitionIndex.*[<>=]|[<>=].*partitionIndex" core/src/main/java/io/questdb/griffin/engine/table/ --type java | head -50

Length of output: 16501


🏁 Script executed:

#!/bin/bash
# 7. Check for resource leaks in new mmapNoCache paths
echo "=== Checking mmapNoCache implementations ==="
rg -n "mmapNoCache|mremapNoCache" --type java -C 5 | head -100

echo -e "\n=== Checking if file descriptors are properly closed ==="
rg -n "mapRONoCache.*fd" core/src/main/java/io/questdb/cairo/TableUtils.java -C 5

Length of output: 10234


🏁 Script executed:

#!/bin/bash
# 8. Check for thread-safety issues with streamingMode flag
echo "=== Checking if streamingMode is volatile or synchronized ==="
rg -n "private.*streamingMode|protected.*streamingMode" --type java -A 2

echo -e "\n=== Checking for concurrent TableReader access ==="
rg -n "class.*TableReader|synchronized.*TableReader" core/src/main/java/io/questdb/cairo/TableReader.java | head -20

Length of output: 509


🏁 Script executed:

#!/bin/bash
# 9. Check MmapCache.isSingleUse logic change and its implications
echo "=== Checking MmapCache.isSingleUse implementation ==="
cat core/src/main/java/io/questdb/std/MmapCache.java | grep -A 10 -B 5 "isSingleUse"

echo -e "\n=== Checking madvise call sites that depend on isSingleUse ==="
rg -n "isSingleUse" --type java -B 2 -A 5

Length of output: 2045


🏁 Script executed:

#!/bin/bash
# 10. Check if streamingMode is properly cleaned up on errors
echo "=== Checking exception handling in export paths ==="
rg -n "setStreamingMode.*true" --type java -B 5 -A 15 | head -80

echo -e "\n=== Checking if streamingMode is reset in finally blocks ==="
rg -n "finally.*streamingMode|streamingMode.*finally" --type java -C 5

Length of output: 6821


🏁 Script executed:

#!/bin/bash
# 11. Check for null pointer dereference in releaseOpenPartitions
echo "=== Checking reader null checks in cursor implementations ==="
ast-grep --pattern 'public void releaseOpenPartitions() {
  $$$
  reader.closePartitionByIndex($$$);
  $$$
}'

echo -e "\n=== Checking if reader can be null ==="
rg -n "reader\s*=\s*null|this\.reader\s*=" core/src/main/java/io/questdb/griffin/engine/table/FwdTableReaderPageFrameCursor.java -C 3

Length of output: 101


🏁 Script executed:

#!/bin/bash
# 12. Analyze SIMD unsafe transmute usage
echo "=== Checking Rust unsafe transmute blocks in detail ==="
rg -n "unsafe.*transmute|transmute.*values" core/rust/qdbr/src/parquet_write/simd.rs -C 5 | head -80

echo -e "\n=== Checking if SIMD vector sizes match ==="
rg -n "Simd<.*8>|Simd<.*16>" core/rust/qdbr/src/parquet_write/simd.rs | head -30

Length of output: 4357


🏁 Script executed:

#!/bin/bash
# 13. Check for missing error handling in partition release
echo "=== Looking for try-catch around partition operations ==="
rg -n "try\s*\{" core/src/main/java/io/questdb/cutlass/parquet/SQLSerialParquetExporter.java -A 30 | grep -E "(closePartitionByIndex|setStreamingMode)" -B 5 -A 5

echo -e "\n=== Checking if closePartitionByIndex can throw ==="
ast-grep --pattern 'public void closePartitionByIndex($$$) {
  $$$
  throw $$$;
  $$$
}'

Length of output: 110


🏁 Script executed:

#!/bin/bash
# 14. Check partition index tracking initialization
echo "=== Checking initialization of partition tracking indices ==="
rg -n "lowestOpenPartitionIndex.*=|highestOpenPartitionIndex.*=" --type java -C 3

echo -e "\n=== Checking if indices can go out of bounds ==="
rg -n "partitionCount.*<|partitionCount.*>" core/src/main/java/io/questdb/cairo/TableReader.java | head -20

Length of output: 8872


@bluestreak01, I've completed a deep analysis of this PR. Here are the critical bugs and issues I found:

🔴 Critical Issues

1. Thread-Safety Bug: Non-Volatile streamingMode Flag

File: TableReader.java:99

private boolean streamingMode = false;  // ❌ NOT volatile

Problem:

  • TableReader has no class-level synchronization
  • Multiple threads can access the same reader instance
  • Without volatile, changes to streamingMode may not be visible across threads
  • This can cause some partitions to use cached mappings while others don't, leading to inconsistent behavior

Fix: Declare as volatile boolean streamingMode = false;


2. State Leak: streamingMode Never Reset on Error Paths

File: SQLSerialParquetExporter.java:141

reader.setStreamingMode(true);
// ... export logic ...
// ❌ No corresponding setStreamingMode(false) in finally block

Problem:

  • If an exception occurs during export, streamingMode stays true
  • When the reader is closed and potentially reused (via pooling), it retains this state
  • While goPassive() resets it (line 615), this might not be called in all error scenarios
  • Subsequent queries could incorrectly use streaming mode, degrading performance

Fix: Add cleanup in finally block:

try {
    reader.setStreamingMode(true);
    // export logic
} finally {
    reader.setStreamingMode(false);
}

3. Semantic Change in MmapCache.isSingleUse() Logic

File: MmapCache.java:121-123

// OLD: return cacheRecord != null && cacheRecord.count == 1;
// NEW:
return cacheRecord == null || cacheRecord.count <= 1;

Problem:

  • NEW behavior returns true when address is NOT in cache (null case)
  • Files.madvise() now calls madvise0() for unmapped/unknown addresses
  • Could trigger madvise on:
    • Stack addresses
    • Already-unmapped memory
    • Invalid pointers
  • While Linux madvise should fail gracefully, this is undefined behavior

Why changed: To support NoCache mappings (which bypass cache with key=0)

Risk: The count <= 1 also considers count=0 as single-use, which seems incorrect. If count reaches 0, the mapping should have been removed from cache.


4. Missing Null Guard in releaseOpenPartitions()

Files: FwdTableReaderPageFrameCursor.java:164, BwdTableReaderPageFrameCursor.java:167

reader.closePartitionByIndex(i);  // ❌ No null check on reader

Problem:

  • If cursor is reused or improperly initialized, reader could be null
  • This would cause NPE during partition release

Fix: Add null guard in releaseOpenPartitions():

if (reader == null) return;

5. Potential Race: madvise Between Active Access

File: TableReader.java:851-856

if (streamingMode) {
    ff.madvise(mem.addressOf(0), mem.size(), Files.POSIX_MADV_DONTNEED);
}
Misc.free(columns.get(index));  // munmap happens here

Problem:

  • MADV_DONTNEED tells kernel to discard pages immediately
  • If another thread is still reading from this partition, it will page-fault
  • While this should be safe (OS refills from disk), it degrades performance
  • The code assumes single-threaded access during close, but this isn't enforced

Mitigation: Ensure no concurrent access during partition close, or move madvise to happen atomically with munmap.


⚠️ High-Priority Issues

6. Off-by-One Risk: Forward Cursor Partition Tracking

File: FwdTableReaderPageFrameCursor.java:159-166

if (lowestOpenPartitionIndex >= reenterPartitionIndex) {
    return;
}
for (int i = lowestOpenPartitionIndex; i < reenterPartitionIndex; i++) {
    reader.closePartitionByIndex(i);
}

Problem:

  • If reenterPartitionIndex is updated before releaseOpenPartitions() is called
  • And a previous partition was skipped (e.g., empty partition)
  • Could close partitions that were never opened
  • closePartitionByIndex has bounds checking (lines 288-290), so it won't crash, but wasted work

7. Rust SIMD: Unsafe Transmute Relies on Internal Layout

File: qdbr/src/parquet_write/simd.rs:705, 782, 836, 898, 991, 1068, 1122, 1184

let bits: Simd<i64, 8> = unsafe { std::mem::transmute(values) };

Problem:

  • Assumes Simd<f64, 8> and Simd<i64, 8> have identical memory layout
  • Currently true in portable_simd, but this is an implementation detail
  • If Rust changes SIMD internals, this breaks silently
  • No #[repr(transparent)] guarantee

Better approach: Use to_bits() if available, or document the layout assumption explicitly with a compile-time assertion.


8. No Error Handling Around closePartitionByIndex()

File: FwdTableReaderPageFrameCursor.java:164

for (int i = lowestOpenPartitionIndex; i < reenterPartitionIndex; i++) {
    reader.closePartitionByIndex(i);  // ❌ No try-catch
}
lowestOpenPartitionIndex = reenterPartitionIndex;  // Could be skipped on exception

Problem:

  • closePartitionByIndex → closePartition → closePartitionResources can throw
  • If exception occurs mid-loop, lowestOpenPartitionIndex is not updated
  • Next call to releaseOpenPartitions() will retry failed partitions
  • Could cause cascading failures

Fix: Wrap in try-catch or update index inside loop.


🟡 Moderate Issues

9. File Descriptor Cleanup: API Confusion

File: TableUtils.java:1255-1262, 1274-1281

Two versions of mapRONoCache:

  1. mapRONoCache(ff, fd, size, memoryTag) - expects open fd
  2. mapRONoCache(ff, path, log, size, memoryTag) - opens and closes fd

Problem:

  • Confusing API: caller must know which version to use
  • If version (1) is called with a shared fd, and caller expects it to remain open, but later code closes it → double-close
  • Not actually a bug in current usage, but error-prone design

10. Enterprise CI Failure

Evidence: Comment from questdb-butler shows build 212862 failed

Problem:

  • Indicates real issues with these changes in production-like scenarios
  • Could be race conditions, state corruption, or resource leaks manifesting under load
  • Should not merge until CI passes

📊 Summary

Severity Count Must Fix Before Merge
Critical 5 ✅ Yes
High 3 ✅ Yes
Moderate 2 ⚠️ Investigate

Top 3 Must-Fix:

  1. Make streamingMode volatile (thread-safety)
  2. Reset streamingMode in finally blocks (state cleanup)
  3. Investigate MmapCache.isSingleUse() semantic change (potential undefined behavior)

The enterprise CI failure strongly suggests these aren't theoretical issues. I recommend:

  1. Add thread-safety annotations/fixes
  2. Add comprehensive error handling
  3. Write stress tests for concurrent partition access
  4. Review all callers of the new NoCache mapping APIs

🧠 Learnings used
Learnt from: jerrinot
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2026-02-04T11:18:36.475Z
Learning: For PRs by jerrinot, always use a thorough and critical review mode: question design decisions, challenge assumptions, scrutinize implementations for edge cases and potential issues, and avoid superficial validation. They prefer hard-core reviews that push back on the code.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 1991 / 2093 (95.13%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/table/SelectedRecordCursorFactory.java 0 2 00.00%
🔵 io/questdb/griffin/engine/table/ExtraNullColumnCursorFactory.java 0 4 00.00%
🔵 io/questdb/cairo/sql/PageFrameCursor.java 0 2 00.00%
🔵 io/questdb/std/Os.java 0 1 00.00%
🔵 qdbr/src/parquet_write/jni.rs 0 10 00.00%
🔵 io/questdb/cairo/TableUtils.java 7 22 31.82%
🔵 io/questdb/std/FilesFacadeImpl.java 1 2 50.00%
🔵 io/questdb/std/Files.java 3 6 50.00%
🔵 io/questdb/cairo/vm/MemoryCMRImpl.java 8 10 80.00%
🔵 qdbr/src/parquet_write/primitive.rs 114 140 81.43%
🔵 io/questdb/cairo/TableReader.java 23 24 95.83%
🔵 io/questdb/cutlass/parquet/CopyExportRequestTask.java 25 26 96.15%
🔵 qdbr/src/parquet_write/simd.rs 1733 1767 98.08%
🔵 io/questdb/griffin/engine/table/TablePageFrameCursor.java 4 4 100.00%
🔵 io/questdb/griffin/engine/table/FwdTableReaderPageFrameCursor.java 9 9 100.00%
🔵 io/questdb/cutlass/parquet/SQLSerialParquetExporter.java 3 3 100.00%
🔵 io/questdb/std/MmapCache.java 1 1 100.00%
🔵 io/questdb/griffin/engine/QueryProgress.java 4 4 100.00%
🔵 qdbr/src/parquet_write/file.rs 17 17 100.00%
🔵 io/questdb/griffin/engine/table/parquet/MappedMemoryPartitionDescriptor.java 5 5 100.00%
🔵 io/questdb/cutlass/parquet/HTTPSerialParquetExporter.java 5 5 100.00%
🔵 io/questdb/griffin/engine/table/BwdTableReaderPageFrameCursor.java 11 11 100.00%
🔵 io/questdb/cairo/TableWriter.java 14 14 100.00%
🔵 io/questdb/cairo/vm/MemoryCMRDetachedImpl.java 4 4 100.00%

@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Feb 4, 2026

@coderabbitai I've reviewed each of your comments against the actual codebase. Here's my assessment:

Critical Issues Assessment

1. Thread-Safety Bug: Non-Volatile streamingMode Flag - ❌ NOT A REAL ISSUE

QuestDB's TableReader follows single-owner semantics - when you get a reader from the pool via getReader(), you have exclusive access to it. The reader is not shared across threads during its lifecycle. The ReaderPool extends AbstractMultiTenantPool which manages exclusive access.

Making this volatile would be pointless overhead. The codebase has many non-volatile fields because single-thread ownership is the design invariant.

2. State Leak: streamingMode Never Reset on Error Paths - ❌ NOT A REAL ISSUE

Look at line 138 in SQLSerialParquetExporter.java:

try (TableReader reader = cairoEngine.getReader(tableToken)) {
    reader.setStreamingMode(true);

This is a try-with-resources! When the reader is closed (including on exceptions), close() calls goPassive() which resets streamingMode = false (line 615). You completely missed this.

3. Semantic Change in MmapCache.isSingleUse() Logic - ❌ NOT A REAL ISSUE

The claim that this could call madvise on "stack addresses" or "invalid pointers" is nonsense. The address parameter comes from existing memory mappings. The point of this change is to allow mmapNoCache mappings (which use mmapCacheKey=0 and don't get stored in the cache) to also benefit from madvise. When address is not in cache, it's either a NoCache mapping or something else - either way, calling madvise on it is fine; the kernel validates addresses.

4. Missing Null Guard in releaseOpenPartitions() - ❌ NOT A REAL ISSUE

If reader were null in FwdTableReaderPageFrameCursor, the entire cursor would be unusable. The cursor is initialized with a reader in of() and the reader is never set to null during the cursor's lifecycle. Adding a null check here would be cargo-cult defensive programming.

5. Potential Race: madvise Between Active Access - ❌ NOT A REAL ISSUE

Same problem as #1 - assumes multi-threaded access to a single-owner component. When you call closePartitionByIndex, you're the only one with access to that partition through that reader. There's no race.

High-Priority Issues Assessment

6. Off-by-One Risk: Forward Cursor Partition Tracking - ❌ NOT A REAL ISSUE

Speculation about what "could" happen if indices are updated in certain orders, but no actual bug shown. The code is straightforward: close partitions from lowestOpenPartitionIndex to reenterPartitionIndex-1. The guard if (lowestOpenPartitionIndex >= reenterPartitionIndex) return; handles edge cases.

7. Rust SIMD: Unsafe Transmute Relies on Internal Layout - ❌ NOT A REAL ISSUE

Transmuting between Simd<f64, 8> and Simd<i64, 8> is standard practice for bit manipulation of floats. Both types have 64 bytes of SIMD data with identical layout. This is the same pattern used in Rust's f64::to_bits() implementation and every SIMD library ever written.

8. No Error Handling Around closePartitionByIndex() - ❌ NOT A REAL ISSUE

Looking at closePartitionByIndex (line 287-296), it has explicit bounds checking and returns early for invalid indices. The underlying closePartition doesn't throw on normal operation. Manufacturing "what if" scenarios without evidence.

Summary

These comments are AI-generated noise that:

  1. Doesn't understand QuestDB's single-owner threading model
  2. Missed obvious code patterns like try-with-resources
  3. Invented multi-threading scenarios that don't exist
  4. Suggested "defensive" fixes that would be useless clutter

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 4, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@ideoma, thank you for the detailed pushback. Let me verify your claims against the actual codebase to provide a proper response.


🏁 Script executed:

#!/bin/bash
# 1. Verify single-owner semantics via ReaderPool
echo "=== Checking ReaderPool and AbstractMultiTenantPool ==="
rg -n "class ReaderPool|extends AbstractMultiTenantPool" --type java -A 5

echo -e "\n=== Checking AbstractMultiTenantPool for exclusive access patterns ==="
rg -n "class AbstractMultiTenantPool" --type java -A 20 | head -50

# 2. Verify try-with-resources and goPassive cleanup
echo -e "\n=== Checking SQLSerialParquetExporter line 138 area ==="
sed -n '135,145p' core/src/main/java/io/questdb/cutlass/parquet/SQLSerialParquetExporter.java

echo -e "\n=== Checking TableReader.goPassive for streamingMode reset ==="
rg -n "void goPassive|streamingMode.*false" core/src/main/java/io/questdb/cairo/TableReader.java -A 3 -B 3

# 3. Check if TableReader.close() calls goPassive()
echo -e "\n=== Checking TableReader.close implementation ==="
ast-grep --pattern 'public void close() {
  $$$
}'  | grep -A 10 "TableReader"

Length of output: 14792


🏁 Script executed:

#!/bin/bash
# 4. Verify MmapCache.isSingleUse usage context
echo "=== Checking where isSingleUse is called ==="
rg -n "isSingleUse\(" --type java -B 3 -A 3

# 5. Check FwdTableReaderPageFrameCursor initialization
echo -e "\n=== Checking how reader is initialized in cursor ==="
rg -n "class FwdTableReaderPageFrameCursor|this\.reader\s*=" core/src/main/java/io/questdb/griffin/engine/table/FwdTableReaderPageFrameCursor.java -B 2 -A 5

# 6. Look for any documented threading model
echo -e "\n=== Searching for threading model documentation ==="
rg -n "single.*owner|exclusive.*access|thread.*safe" --type java core/src/main/java/io/questdb/cairo/TableReader.java | head -20

Length of output: 1775


🏁 Script executed:

#!/bin/bash
# 7. Check closePartitionByIndex implementation to verify error handling claims
echo "=== Checking closePartitionByIndex implementation ==="
sed -n '287,296p' core/src/main/java/io/questdb/cairo/TableReader.java

echo -e "\n=== Checking full closePartitionByIndex method ==="
ast-grep --pattern 'public void closePartitionByIndex($$$) {
  $$$
}'

Length of output: 1505


🏁 Script executed:

#!/bin/bash
# 8. Check if there are any other usages that might violate single-owner
echo "=== Checking for potential concurrent TableReader access patterns ==="
rg -n "synchronized.*reader|volatile.*TableReader|AtomicReference.*TableReader" --type java | head -20

# 9. Look for any pool checkout/return patterns
echo -e "\n=== Checking ReaderPool get/return patterns ==="
rg -n "getReader|returnReader" --type java core/src/main/java/io/questdb/cairo/ -C 2 | head -40

Length of output: 4576


@ideoma, you're absolutely right, and I apologize for the noise. Let me address each point:

My Mistakes

1. Thread-Safety - You're Correct ✅

I completely missed QuestDB's single-owner pooling model. The scripts confirm:

  • ReaderPool extends AbstractMultiTenantPool with ThreadLocal supervisors
  • Once checked out, readers have exclusive single-thread ownership
  • Line 1165 in TableReader even comments "this method is not thread safe" - documenting the single-owner assumption
  • Making streamingMode volatile would be pointless overhead

2. State Cleanup - You're Correct ✅

I missed the try-with-resources pattern at line 138:

try (TableReader reader = cairoEngine.getReader(tableToken)) {
    reader.setStreamingMode(true);

close()goPassive() → line 615 resets streamingMode = false. The cleanup is there. My suggestion to add a finally block was redundant.

3. MmapCache.isSingleUse() - You're Correct ✅

The change to return cacheRecord == null || cacheRecord.count <= 1 is intentional. The null case handles NoCache mappings (mmapCacheKey=0) which bypass the cache. These are legitimate memory-mapped addresses from mmapNoCache, not random pointers. My concern about "stack addresses" was wrong.

4. Null Guards - You're Correct ✅

reader cannot be null in a properly initialized cursor. Adding defensive null checks would be cargo-cult programming without evidence of actual issues.

5-8. Other Issues - You're Correct ✅

All theoretical scenarios without evidence in the actual codebase design.


The Real Issue

However, there's one objective fact I'd like to understand:

Enterprise CI build 212862 failed (per questdb-butler's comment). That's not theoretical - it's a real failure in a production-like environment. Given that all my static analysis concerns were off-base, the CI failure likely points to something I didn't catch.

Could you share:

  1. What the Enterprise CI failure was about?
  2. Has it been addressed, or is it unrelated to these changes?

I clearly need to better understand QuestDB's architecture before reviewing. Thank you for the detailed corrections - they're valuable learning points about the codebase's design patterns.

(\\_/)
(o.o)
(> <)  *learning*

✏️ Learnings added
Learnt from: ideoma
URL: https://github.com/questdb/questdb/pull/6735

Timestamp: 2026-02-04T18:24:33.846Z
Learning: QuestDB's TableReader and similar components follow a single-owner pooling model via AbstractMultiTenantPool. Once a reader is checked out from the pool (e.g., via getReader()), it has exclusive single-thread ownership and is not accessed concurrently. Fields like streamingMode do not need volatile or synchronization. Thread-safety concerns should focus on the pool's checkout/return logic, not the individual reader's state.

Learnt from: ideoma
URL: https://github.com/questdb/questdb/pull/6735

Timestamp: 2026-02-04T18:24:33.846Z
Learning: In QuestDB, try-with-resources on TableReader ensures proper cleanup. The close() method calls goPassive() which resets transient state like streamingMode. No additional finally blocks are needed for such cleanup.

Learnt from: ideoma
URL: https://github.com/questdb/questdb/pull/6735

Timestamp: 2026-02-04T18:24:33.846Z
Learning: MmapCache.isSingleUse() returning true for null cache records is intentional to support NoCache mappings (mmapCacheKey=0) which bypass the cache. These are legitimate memory-mapped addresses, not invalid pointers.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@bluestreak01 bluestreak01 merged commit 50002c6 into master Feb 5, 2026
46 checks passed
@bluestreak01 bluestreak01 deleted the feat-parquet-export-speedup branch February 5, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants