Skip to content

fix(pgwire): fix protocol corruption when sending arrays from server to client#6455

Merged
bluestreak01 merged 13 commits intomasterfrom
vi_pg_fix
Nov 27, 2025
Merged

fix(pgwire): fix protocol corruption when sending arrays from server to client#6455
bluestreak01 merged 13 commits intomasterfrom
vi_pg_fix

Conversation

@bluestreak01
Copy link
Copy Markdown
Member

@bluestreak01 bluestreak01 commented Nov 26, 2025

Ent PR: https://github.com/questdb/questdb-enterprise/pull/790

the problem was reproducible with this python script:

from questdb.ingress import Sender, TimestampNanos
import numpy as np
import psycopg

conf = "http::addr=localhost:9000;username=admin;password=quest;"
i = 0
with Sender.from_conf(conf) as sender:
    while i < 10000:
        sender.row(
            'ob',
            symbols={'symbol': 'XXX' },
            columns={
                'bids': np.array(
                [
                    [1.0850, 600000],
                    [1.0849, 300000],
                    [1.0848, 150000]
                ],
                dtype=np.float64
            ),
            'asks': np.array(
                [
                    [1.0853, 500000],
                    [1.0854, 250000],
                    [1.0855, 125000]
                ],
                dtype=np.float64
            )},
            at=TimestampNanos.now())
        sender.flush()
        i+=1


conn = psycopg.connect(
    host='localhost',
    port=8812,
    user='admin',
    password='quest',
    dbname='qdb',
    autocommit=True  # Important for QuestDB
)



batchs = []
with conn.cursor(binary=True) as cur:
    # Execute a query that might return many rows
    cur.execute("SELECT * FROM Ob") #works with LIMIT 5

    batch_size = 10
    total_processed = 0

    while True:
        batch = cur.fetchmany(batch_size)
        
        if not batch:
            break
        batchs += batch
        total_processed += len(batch)
        # if total_processed % 10000 == 0:
        print(f"Processed {total_processed} rows so far...")

    print(f"Finished processing {total_processed} total rows")

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 26, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR introduces binary and text mode testing for psycopg3 connections, refactors PGWire array column binary size calculations with sentinel-based header management, updates test expectations with consistent spacing, and removes legacy WAL skip logic from PGArraysTest while adding queue draining.

Changes

Cohort / File(s) Summary
Python Test Runner Mode Support
compat/src/test/python/runner_psycopg3.py
Added binary parameter to run_test() function. Main test runner now loops over binary and text modes for psycopg connections, running each test in both modes with mode labeling and autocommit enabled.
YAML Test Case Formatting
compat/src/test/resources/test_cases.yaml
Reformatted YAML list and record representations to include consistent spacing around elements (e.g., [1][ 1 ], ["BTC-USD", "buy", ...][ "BTC-USD", "buy", ... ]). No semantic changes to test logic or data values.
PGWire Array Resend Logic
core/src/main/java/io/questdb/cutlass/pgwire/PGPipelineEntry.java
Introduced sentinel value -1 for outResendArrayFlatIndex to indicate unsent array header. Updated initialization, reset paths in close() and clear(), and header-send condition in outColBinArr from == 0 to == -1. Adjusted column tail size calculation to compute effective resend indices per column.
PGWire Utility Array Sizing Refactor
core/src/main/java/io/questdb/cutlass/pgwire/PGUtils.java
Refactored array column binary size calculation. Renamed calculateArrayColBinSize() to calculateArrayColBinSizeIncludingHeader(). Added new public method calculateArrayResumeColBinSize() for resume-only sizing. Adjusted internal calculateColumnBinSize() logic to compute actual resume point, remaining elements, and not-null count. Refactored countNotNull() and estimateColumnTxtSize() to use switch expressions. Added private helper calculateArrayHeaderSize().
PGWire Array Tests WAL Handling
core/src/test/java/io/questdb/test/cutlass/pgwire/PGArraysTest.java
Removed skipOnWalRun() calls throughout test methods and deleted the helper method. Added drainWalQueue() invocations following test setup and insert paths to flush WAL state.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • PGUtils.java: Dense refactoring with new public API methods and revised array sizing logic requiring careful verification of edge cases (arrayResumePoint, notNullCount, header inclusion semantics)
  • PGPipelineEntry.java: Sentinel value semantics (-1 vs. 0) and state transitions for array header resend logic require tracing through header send/reset paths
  • PGArraysTest.java: Verify that drainWalQueue() placement correctly flushes WAL state without masking unrelated test issues; confirm removal of skipOnWalRun() is intentional
  • runner_psycopg3.py: Validate that binary and text mode loops execute independently and connection cleanup is correct

Suggested labels

Bug

Suggested reviewers

  • bluestreak01

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.38% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing protocol corruption when sending arrays from server to client, which aligns with the core changes in PGPipelineEntry.java and PGUtils.java.
Description check ✅ Passed The description provides a reproducible Python script demonstrating the protocol corruption issue when sending arrays via psycopg with binary=True cursor, directly relating to the changeset's purpose.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

before it was conflated with header sending
skipOnWal() was designed for tests that don't touch physical tables,
where running in both WAL and non-WAL modes would be redundant.

However, since WAL mode is now randomized, a test marked with
skipOnWal() may never execute if the randomizer picks WAL mode.
@jerrinot
Copy link
Copy Markdown
Contributor

@coderabbitai review please

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 26, 2025

@jerrinot: I'll review the changes in this PR that fix the protocol corruption issue when sending arrays from server to client.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
core/src/main/java/io/questdb/cutlass/pgwire/PGUtils.java (1)

236-247: Remove duplicate comments.

Lines 239-240 and 244-246 contain duplicated "take rough upper estimate based on the string length" comments. This appears to be a copy-paste artifact.

             case ColumnType.STRING -> {
                 final CharSequence strValue = record.getStrA(columnIndex);
-                // take rough upper estimate based on the string length
                 yield strValue == null ? Integer.BYTES : Integer.BYTES + 3L * strValue.length();
-                // take rough upper estimate based on the string length
             }
             case ColumnType.SYMBOL -> {
                 final CharSequence symValue = record.getSymA(columnIndex);
-                // take rough upper estimate based on the string length
                 yield symValue == null ? Integer.BYTES : Integer.BYTES + 3L * symValue.length();
-                // take rough upper estimate based on the string length
             }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9968b7c and 51db7ba.

📒 Files selected for processing (5)
  • compat/src/test/python/runner_psycopg3.py (2 hunks)
  • compat/src/test/resources/test_cases.yaml (25 hunks)
  • core/src/main/java/io/questdb/cutlass/pgwire/PGPipelineEntry.java (5 hunks)
  • core/src/main/java/io/questdb/cutlass/pgwire/PGUtils.java (4 hunks)
  • core/src/test/java/io/questdb/test/cutlass/pgwire/PGArraysTest.java (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-04T09:16:27.366Z
Learnt from: jerrinot
Repo: questdb/questdb PR: 5996
File: core/src/test/java/io/questdb/test/cutlass/http/line/LineHttpSenderTest.java:350-356
Timestamp: 2025-08-04T09:16:27.366Z
Learning: In QuestDB ILP tests, maximum dimension values like `(1 << 28) - 1` are intentionally hard-coded instead of using constants like `ArrayView.DIM_MAX_LEN` to serve as "canaries" that will cause test failures if the API limits change, ensuring developers are aware of breaking changes to the ILP API.

Applied to files:

  • core/src/test/java/io/questdb/test/cutlass/pgwire/PGArraysTest.java
🧬 Code graph analysis (2)
compat/src/test/python/runner_psycopg3.py (3)
compat/src/test/python/runner_asyncpg.py (1)
  • run_test (98-158)
compat/src/test/python/runner_psycopg2.py (1)
  • run_test (98-131)
core/rust/qdb-sqllogictest/sqllogictest-engines/src/external.rs (1)
  • connect (68-82)
core/src/main/java/io/questdb/cutlass/pgwire/PGPipelineEntry.java (1)
core/src/main/java/io/questdb/cutlass/pgwire/PGUtils.java (1)
  • PGUtils (40-307)
🪛 Ruff (0.14.5)
compat/src/test/python/runner_psycopg3.py

156-156: Possible hardcoded password assigned to argument: "password"

(S106)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (35)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (13)
compat/src/test/python/runner_psycopg3.py (2)

99-103: LGTM! Clean implementation of binary mode support.

The addition of the binary parameter to run_test and its propagation to cursor creation correctly enables dual-mode testing.


141-161: Dual-mode testing implementation is well-structured.

The nested loop structure correctly runs all tests in both binary and text modes, with proper connection isolation per iteration. Creating a new connection for each test iteration ensures clean test state.

Regarding the hardcoded password at line 156: while static analysis flags this, it's acceptable in test code using QuestDB's default credentials.

compat/src/test/resources/test_cases.yaml (1)

5-625: LGTM! Consistent formatting improves readability.

The changes throughout this file are purely cosmetic, adding consistent spacing to array and list elements. No test semantics or expected values have been altered.

core/src/test/java/io/questdb/test/cutlass/pgwire/PGArraysTest.java (1)

98-98: LGTM! Consistent WAL handling across tests.

The drainWalQueue() calls are correctly placed after data modifications (inserts/updates) and before queries. This ensures WAL data is flushed before reading, enabling these tests to run correctly in both WAL and non-WAL modes rather than being skipped.

Also applies to: 117-117, 366-366, 388-388, 417-418, 472-473, 520-522, 597-598, 619-620, 636-637, 694-694

core/src/main/java/io/questdb/cutlass/pgwire/PGUtils.java (4)

61-73: LGTM! Clean separation of header and resume size calculations.

The new API correctly separates:

  • calculateArrayColBinSizeIncludingHeader for initial array serialization
  • calculateArrayResumeColBinSize for resumed partial sends

The nullCount derivation cardinality - notNullCount is correct.


170-180: Correct sentinel-based size calculation.

The logic properly handles the three states:

  • -1: Initial send - includes header size, counts all elements
  • 0: After header sent - no header, resume from element 0
  • >0: Partial resume - no header, resume from element N

Math.max(0, arrayResumePoint) correctly normalizes -1 to 0 for element counting while preserving the header inclusion check at line 176.


259-265: Header size calculation matches serialization.

The formula correctly accounts for:

  • 4 fixed Integer.BYTES fields (size, dim count, has-nulls flag, component type)
  • 2 × Integer.BYTES per dimension (length + lower bound)

This matches the serialization in PGPipelineEntry.outColBinArr.


189-197: Clean refactor to switch expression.

The logic is preserved while improving readability. The explicit AssertionError for unsupported element types is appropriate given the assertion at lines 167-169 that limits to DOUBLE/LONG.

core/src/main/java/io/questdb/cutlass/pgwire/PGPipelineEntry.java (5)

173-173: Critical fix: Sentinel -1 distinguishes "header not sent" from "resume at element 0".

This is the core fix for the protocol corruption. Previously, 0 was overloaded to mean both "header not sent yet" and "resume from element 0", causing the header to be re-sent on buffer overflow recovery. The new -1 sentinel unambiguously indicates "header not yet written."


1077-1090: Correct per-column resume point handling.

The logic properly applies outResendArrayFlatIndex only to the column being resumed (outResendColumnIndex), while other array columns use -1 to indicate they haven't been started (need full send with header).


1721-1772: Correct header/element separation with proper state transitions.

The flow is now:

  1. -1 → send header, bookmark, transition to 0
  2. >= 0 → send elements, incrementing index after each with bookmark
  3. On successful completion → reset to -1
  4. On NoSpaceLeftInResponseBufferException → resetToBookmark preserves last good state

The size calculation at line 1729 correctly subtracts Integer.BYTES because the size field value excludes its own size (standard PG wire protocol).


309-312: State reset in close() is complete.

The reset of outResendArrayFlatIndex to -1 at line 309 is part of the complete state cleanup. The corresponding per-record resets (lines 2884-2886) and per-array reset (line 1767) ensure consistent state during normal operation, while close() handles pooling cleanup.


1714-1719: LGTM!

Null array handling (dimCount == 0) correctly short-circuits before any header/element processing, leaving outResendArrayFlatIndex unchanged. This is appropriate since null arrays require no resend state.

@jerrinot
Copy link
Copy Markdown
Contributor

an unrelated failure in a MV test - testSingleTablePeriodView()

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😞 fail : 24 / 50 (48.00%)

file detail

path covered line new line coverage
🔵 io/questdb/cutlass/pgwire/PGUtils.java 16 42 38.10%
🔵 io/questdb/cutlass/pgwire/PGPipelineEntry.java 8 8 100.00%

@bluestreak01 bluestreak01 merged commit 4cce031 into master Nov 27, 2025
42 of 43 checks passed
@bluestreak01 bluestreak01 deleted the vi_pg_fix branch November 27, 2025 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants