Skip to content

fix(http): prevent parquet export corruption under concurrent connections#6723

Merged
bluestreak01 merged 16 commits intomasterfrom
jh_parquet_export_fix
Feb 4, 2026
Merged

fix(http): prevent parquet export corruption under concurrent connections#6723
bluestreak01 merged 16 commits intomasterfrom
jh_parquet_export_fix

Conversation

@jerrinot
Copy link
Copy Markdown
Contributor

@jerrinot jerrinot commented Jan 29, 2026

Concurrent parquet exports could produce corrupted or truncated files when multiple clients exported simultaneously through the same HTTP worker. Affected clients would receive invalid parquet data, failing to open in downstream tools.

Implementation details for reviewers:
ExportQueryProcessor is instantiated per-worker, but stored per-connection state (currentContext, serialParquetExporter) in per-processor fields. When a worker parked connection A via PeerIsSlowToReadException and began serving connection B, these fields were overwritten. On resume, the onWrite() callback would use the wrong connection context, writing parquet chunks to the wrong response.

Solution: Move both fields to per-connection ExportQueryProcessorState.

TODO:

  • Explore how to eliminate the capturing lambda. Done

…ions

Concurrent parquet exports could produce corrupted or truncated files
when multiple clients exported simultaneously through the same HTTP
worker. Affected clients would receive invalid parquet data, failing
to open in downstream tools.

Implementation details for reviewers:
ExportQueryProcessor is instantiated per-worker, but stored per-connection
state (currentContext, serialParquetExporter) in per-processor fields.
When a worker parked connection A via PeerIsSlowToReadException and began
serving connection B, these fields were overwritten. On resume, the
onWrite callback would use the wrong connection context, writing parquet
chunks to the wrong response.

Solution: Move both fields to per-connection ExportQueryProcessorState.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 29, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Walkthrough

Refactors HTTP parquet export flow to use per-request state-managed exporters and callbacks (moving wiring into ExportQueryProcessorState), migrates per-connection circuit-breaker/SQL-context creation into HttpConnectionContext, updates multiple processors to obtain contexts on demand, and adds a concurrent parquet export test.

Changes

Cohort / File(s) Summary
Parquet export core
core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java, core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessorState.java
Removes direct StreamWriteParquetCallBack implementation and exporter fields from processor; moves HTTPSerialParquetExporter lazy-init, write-callback wiring, and lifecycle into ExportQueryProcessorState; replaces onWrite with state-aware writeParquetData and uses state.getOrCreateSerialParquetExporter(engine).
HTTP connection & selector plumbing
core/src/main/java/io/questdb/cutlass/http/HttpConnectionContext.java, core/src/main/java/io/questdb/cutlass/http/HttpRequestProcessorSelector.java, core/src/main/java/io/questdb/cutlass/http/HttpRequestProcessor.java, core/src/main/java/io/questdb/cutlass/http/HttpServer.java
Adds per-connection lazy circuit-breaker and SQL-execution-context creators/accessors; replaces resumeProcessor with handler-id resume mechanism, exposes selector last-selected handler id and ID-based resolution, and documents processor threading/lifecycle semantics.
Processors: JSON / SQL validation / Export consumers
core/src/main/java/io/questdb/cutlass/http/processors/JsonQueryProcessor.java, core/src/main/java/io/questdb/cutlass/http/processors/JsonQueryProcessorState.java, core/src/main/java/io/questdb/cutlass/http/processors/SqlValidationProcessor.java, core/src/main/java/io/questdb/cutlass/http/processors/JsonQueryProcessorConfiguration.java
Replace per-instance circuit-breaker/SqlExecutionContext fields with per-call retrieval from HttpConnectionContext; introduce sharedWorkerCount usage; add default QueryFutureUpdateListener in configuration; update state factories and method signatures to use connection-provided contexts.
SqlExecutionContext changes
core/src/main/java/io/questdb/griffin/SqlExecutionContextImpl.java
Adds mutable QueryFutureUpdateListener field with getter/setter to allow injection of a listener instance.
Tests and test builders
core/src/test/java/io/questdb/test/cutlass/http/ConcurrentParquetExportTest.java, core/src/test/java/io/questdb/test/cutlass/http/HttpQueryTestBuilder.java, core/src/test/java/io/questdb/test/cutlass/http/HttpServerConfigurationBuilder.java
Adds ConcurrentParquetExportTest (multi-threaded parquet export validation); updates test builders to accept/query future update listener wiring and to construct processors using shared worker counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Bug, REST API

Suggested reviewers

  • bluestreak01
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: fixing parquet export corruption under concurrent connections by moving state from processor fields to per-connection state.
Description check ✅ Passed The pull request description clearly identifies the root cause of concurrent parquet export corruption and explains the solution of moving per-connection state from processor fields to ExportQueryProcessorState.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jh_parquet_export_fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jerrinot jerrinot marked this pull request as draft January 29, 2026 10:57
@jerrinot
Copy link
Copy Markdown
Contributor Author

Hello @coderabbitai, please review this. Focus on concurrency control, consider the lifecycle of connection, http connection context, http request processors, states.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 29, 2026

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Fetch the PR diff to see what changed
gh pr view 6723 --json files --jq '.files[] | "\(.path): +\(.additions) -\(.deletions)"'

# Get the actual diff
gh pr diff 6723

# Find ExportQueryProcessor and related files
fd -e java "ExportQueryProcessor"

# Find ExportQueryProcessorState
fd -e java "ExportQueryProcessorState"

Length of output: 17338


🏁 Script executed:

#!/bin/bash

# Read the full ExportQueryProcessor to check for other potential concurrency issues
cat core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java

# Read the full ExportQueryProcessorState to understand lifecycle
cat core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessorState.java

# Check LocalValue usage pattern
rg -A 5 -B 5 "LocalValue.*ExportQueryProcessorState" core/src/main/java/io/questdb/cutlass/http/processors/

# Look for other places where ExportQueryProcessor fields might be accessed
rg "ExportQueryProcessor\.(currentContext|serialParquetExporter)" --type java

# Check how PeerIsSlowToReadException is used
rg -A 3 -B 3 "throw.*PeerIsSlowToReadException|catch.*PeerIsSlowToReadException" core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java

# Check HTTPSerialParquetExporter for thread safety
cat core/src/main/java/io/questdb/cutlass/parquet/HTTPSerialParquetExporter.java 2>/dev/null || echo "File not found"

# Find the interface definition for the callback
rg -A 5 "interface.*StreamWriteParquetCallBack|class.*StreamWriteParquetCallBack" --type java

Length of output: 86851


🏁 Script executed:

#!/bin/bash

# Check usage of shared mutable fields in ExportQueryProcessor
echo "=== Checking decimal128/decimal256 usage ==="
rg -n -C3 "decimal128|decimal256" core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java | head -50

echo -e "\n=== Checking errSink usage ==="
rg -n -C3 "errSink" core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java

echo -e "\n=== Checking sqlExecutionContext.with() calls ==="
rg -n -C2 "sqlExecutionContext\.with\(" core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java

# Check if LocalValue provides per-connection isolation
echo -e "\n=== Checking LocalValue implementation ==="
rg -A 10 "class LocalValue" --type java

# Check if HTTPSerialParquetExporter has a close method
echo -e "\n=== Checking HTTPSerialParquetExporter lifecycle ==="
rg "class HTTPSerialParquetExporter" -A 20 --type java | head -30

Length of output: 12123


@jerrinot, I've completed a thorough review of this PR focusing on concurrency control, connection lifecycle, HTTP connection context, request processors, and state management.

✅ Core Fix - Correctly Resolves Parquet Export Corruption

The solution properly addresses the root cause:

Problem correctly identified: ExportQueryProcessor is instantiated per-worker but stored per-connection state (currentContext, serialParquetExporter) in processor-level fields. When a worker parked connection A via PeerIsSlowToReadException and began serving connection B, these fields were overwritten, causing the onWrite() callback to write parquet chunks to the wrong response.

Solution is sound: Moving both fields to per-connection ExportQueryProcessorState ensures each connection maintains its own state regardless of worker context switches.

Key improvements:

  1. Callback isolation: The lambda (dataPtr, dataLen) -> processor.writeParquetData(this, dataPtr, dataLen) properly captures the state instance, ensuring the correct connection context is used
  2. Explicit state passing: writeParquetData() now receives state as a parameter instead of relying on currentContext field
  3. Lazy initialization: serialParquetExporter is created on-demand per-connection via getOrCreateSerialParquetExporter()
  4. Proper lifecycle: State is cleared in both clear() and close() methods

Test validation: The test properly validates the fix by spawning 16 concurrent threads with small buffers (1024 bytes + random), forcing connection parking and worker context switches. Each thread validates parquet file integrity and row count matches.

⚠️ Remaining Concurrency Issues (CSV Export Path)

While this PR fixes the parquet export corruption, the same architectural pattern exposes existing concurrency issues in the CSV export path:

1. Shared mutable decimal objects (Lines 119-120, 1210-1215)

private final Decimal128 decimal128 = new Decimal128();
private final Decimal256 decimal256 = new Decimal256();

These are used in putValue()putDecimal128StringValue() during CSV exports. When a worker switches from connection A to connection B:

  • Connection A calls rec.getDecimal128(columnIndex, decimal128)
  • Worker parks connection A, serves connection B
  • Connection B calls rec.getDecimal128(columnIndex, decimal128) - overwrites A's data
  • Connection A resumes, uses corrupted decimal128

Impact: CSV exports with decimal columns could produce corrupted values under concurrent connections.

2. Shared error sink (Lines 122, 508-610, 1042-1044)

private final StringSink errSink = new StringSink();

Used in configureParquetOptions() for error formatting. If two connections concurrently parse invalid parquet options, error messages could be interleaved or overwritten.

Impact: Error messages sent to clients could be corrupted or belong to different connections.

Recommendation

Move decimal128, decimal256, and errSink to ExportQueryProcessorState or use thread-local storage pattern for these temporary objects.

🔍 Minor Observations

  1. Callback initialization: writeCallback is initialized lazily in copyQueryToParquetFile() at Line 635-637. This is correct but could benefit from a comment explaining why it's not initialized in the constructor (perhaps to avoid capturing processor reference prematurely).

  2. HTTPSerialParquetExporter lifecycle: The exporter doesn't implement Closeable, so setting it to null in state.close() is appropriate. However, verify that the exporter's internal sqlExecutionContext doesn't retain references that could cause memory leaks.

  3. sqlExecutionContext handling: The processor's sqlExecutionContext is correctly reset per-request via with() calls (Lines 159, 784), so this is safe despite being a processor-level field.

📋 Summary

This PR successfully fixes the critical parquet export corruption bug and the implementation is architecturally sound. The test provides strong validation. However, the same per-worker processor pattern reveals pre-existing concurrency vulnerabilities in the CSV export path that should be addressed in a follow-up PR.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@jerrinot jerrinot marked this pull request as ready for review January 29, 2026 15:15
@jerrinot jerrinot marked this pull request as draft January 29, 2026 15:17
@jerrinot
Copy link
Copy Markdown
Contributor Author

jerrinot commented Jan 29, 2026

There is another concurrency control issue in the master:

Example: HttpRequestProcessor is being executed by Worker0 and there is a PeerTooSlowToRead exception. We store resumeProcessor inside the HttpConnectionContext and I/O dispatcher waits for WRITE readiness.

Reminder: HTTP Processor instances are bound to a worker (thread). Thus the HttpConnectionContext stores an HTTP Request Processor instance which belongs to Worker0. So far so good.

Eventually, the socket is ready to accept WRITEs. So I/O dispatcher emits an event which can be handled by ANY I/O worker. Say Worker1 picks this event up. It calls the stored resumeProcessor. But wait, this instance of the resumeProcessor belongs to Worker0! If Worker0 concurrently handles requests for the same processor coming from another socket the we have both Worker0 and Worker1 accessing the same HTTP request processor instance concurrently! This is not great, since they both configure sqlConnectionContext, circuitbreaker, etc

Current solution:
I made sql execution context and circuit breakers local to http connection context. that is each connection has its own sql context and CB. this fixes the most severe concurrency issues, but it does not address the root cause: Processors must be thread-safe. This defeats the purpose of creating a processor instance per worker. I see 2 options:

  1. When parking a processor we pin it to a specific worker. So it will always resume on the same worker which owns the processor instance. This is simple, but it limits scalability.
  2. Use indirection: Instead of storing a direct reference to resumeProcessor we store processor name/ID and on resume a worker resolves the ID to its local processor with this ID.

To be continued

…reference to resume processor

Reasoning: HTTP Request Processors are Worker-local. If we store a reference
while running on on Worker0 and later are resume on Worker1 then the Worker1
will invoke a processor meant to be used by Worker0. If Worker0 concurrently
handles another socket event then both Worker0 and Worker1 will use the same
processor instance. This violates the contract of http request processor - they
are NOT thread-safe.
@jerrinot jerrinot marked this pull request as ready for review January 30, 2026 11:50
@jerrinot jerrinot marked this pull request as draft January 30, 2026 12:05
@jerrinot
Copy link
Copy Markdown
Contributor Author

@coderabbitai: please do another review round. I did life-cycle changes.

note: processors are single-threaded. so a mutable errorSink or decimal256 fields are OK as long as they are used as scratchpads and they are not expected to survive cross request/resume boundaries

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 30, 2026

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Fetch the latest PR diff to see all changes
echo "=== Latest PR diff ==="
gh pr diff 6723

# Check for lifecycle-related changes in HttpConnectionContext
echo -e "\n=== Checking HttpConnectionContext changes ==="
gh pr diff 6723 -- "**/HttpConnectionContext.java" || echo "No changes to HttpConnectionContext.java in diff"

# Check for SqlExecutionContext changes
echo -e "\n=== Checking SqlExecutionContext changes ==="
gh pr diff 6723 -- "**/SqlExecutionContext*.java"

# Check for circuit breaker related changes
echo -e "\n=== Checking for circuit breaker changes ==="
gh pr diff 6723 | grep -i "circuit" -A 3 -B 3 || echo "No circuit breaker mentions in diff"

# List all changed files in the PR
echo -e "\n=== All changed files ==="
gh pr view 6723 --json files --jq '.files[].path'

Length of output: 84295


@jerrinot, excellent architectural improvements! The lifecycle changes effectively address the cross-worker concurrency issue you described. Let me walk through the solution:

✅ Core Architecture - Handler ID Resolution System

Problem solved: Previously, HttpConnectionContext stored a direct reference to resumeProcessor, which belonged to Worker0. When Worker1 picked up the WRITE event, it would access Worker0's processor instance concurrently.

Solution implemented:

  1. Store resumeHandlerId (integer) instead of processor reference
  2. Each handler is assigned a unique ID when bound to the server
  3. On resume, the current worker resolves the handler ID to its own processor instance via selector.resolveProcessorById()

This ensures Worker1 always accesses its own processor instance, never Worker0's.

✅ Per-Connection Resource Isolation

The move of circuit breaker and SQL execution context from processor-level to per-connection is critical:

HttpConnectionContext now owns: httpCircuitBreaker and httpSqlExecutionContext as instance fields, lazily initialized via getOrCreateCircuitBreaker() and getOrCreateSqlExecutionContext()

All processors changed: Removed processor-level circuitBreaker and sqlExecutionContext fields; now retrieve these from context on each request

This prevents the configuration conflicts you mentioned (e.g., Worker0 and Worker1 both configuring the same circuit breaker instance).

✅ Lifecycle Management

Initialization:

  • Circuit breaker and SQL context are created lazily on first use per connection

Cleanup:

  • In reset(): resumeHandlerId = NO_RESUME_PROCESSOR and httpCircuitBreaker.clear() if non-null
  • In close(): Both resources are freed with Misc.free()

✅ Documentation

The extensive JavaDoc added to HttpRequestProcessor clearly explains:

  • Threading model: processors are not thread-safe, one worker per connection at a time
  • Park/resume lifecycle: different workers may resume, handler ID resolution enables this
  • Instance fields: safe for scratchpad use, but don't survive park/resume boundaries

🎯 Scratchpad Fields Clarification

Thank you for clarifying that processors are guaranteed single-threaded. You're absolutely correct:

  • decimal128, decimal256, errSink are safe as temporary working memory
  • They're used within a single request processing call and don't need to survive park/resume
  • When a connection resumes on a different worker, that worker's processor has its own scratchpad instances

My previous concern about these fields was based on misunderstanding the threading guarantees. The real issue was the cross-worker processor access, which you've now solved.

🔍 Minor Observations

  1. Handler ID for RejectProcessor: Special handling with REJECT_PROCESSOR_ID = -1 constant and explicit checks when parking - this looks correct.

  2. Test coverage: The concurrent test spawns 16 threads with small buffers to force parking and worker switches - excellent validation of the fix.

  3. Backward compatibility: The resolveProcessorById() default implementation throws UnsupportedOperationException, which is appropriate for selectors that don't support resumption.

📋 Summary

The lifecycle changes successfully solve the cross-worker concurrency issue through handler ID indirection and per-connection resource ownership. The architecture now correctly enforces:

  • Processor isolation: Each worker accesses only its own processor instances
  • Connection isolation: Each connection has its own circuit breaker and SQL execution context
  • Safe resumption: Handler IDs enable connections to resume on any worker without cross-thread access

The implementation is sound and the test coverage validates the concurrent scenario. Excellent work on addressing the root cause!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/test/java/io/questdb/test/cutlass/http/HttpQueryTestBuilder.java (1)

98-98: ⚠️ Potential issue | 🟡 Minor

Remove the unused sqlExecutionContexts field.

The field is declared at line 98 but is not referenced anywhere in the codebase. It should be removed.

🤖 Fix all issues with AI agents
In `@core/src/main/java/io/questdb/cutlass/http/HttpServer.java`:
- Around line 425-428: resolveProcessorById currently calls
handlersByIdList.getQuick(handlerId).getProcessor(header) without validating
handlerId or the returned entry; add bounds and null checks to avoid OOB/NPE:
first verify handlerId >= 0 && handlerId < handlersByIdList.size(), then fetch
the entry (using getQuick or get) and check it for null before calling
getProcessor(header); if validation fails, return null (or an appropriate
no-op/error value consistent with surrounding code) instead of dereferencing a
possibly invalid entry.

In
`@core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessor.java`:
- Around line 1054-1067: The current parsing of timeout in ExportQueryProcessor
overrides the per-format default when the provided timeout ≤ 0; update the logic
to preserve the format-specific default: compute a defaultTimeout based on
exportModel.isParquetFormat() (use configuration.getExportTimeout()) or for CSV
retrieve NetworkSqlExecutionCircuitBreaker via
context.getOrCreateCircuitBreaker(engine) and use
circuitBreaker.getDefaultMaxTime(), then when parsing
request.getUrlParam(URL_PARAM_TIMEOUT) with Numbers.parseLong(...) set
state.timeout = parsed*1000 and if parsed <= 0 set state.timeout =
defaultTimeout; this ensures ExportQueryProcessor keeps the parquet vs CSV
defaults (use the same symbols: exportModel.isParquetFormat(),
configuration.getExportTimeout(), context.getOrCreateCircuitBreaker(engine),
NetworkSqlExecutionCircuitBreaker.getDefaultMaxTime(), request.getUrlParam,
Numbers.parseLong).
🧹 Nitpick comments (2)
core/src/test/java/io/questdb/test/cutlass/http/ConcurrentParquetExportTest.java (1)

114-116: Consider a parallel join with total timeout.

The sequential join(120_000) per thread means worst-case 32 minutes if all threads hang. A common pattern is to track remaining time across joins:

long deadline = System.currentTimeMillis() + 120_000;
for (Thread thread : threads) {
    long remaining = deadline - System.currentTimeMillis();
    if (remaining > 0) {
        thread.join(remaining);
    }
}

This is a minor suggestion since the test should complete quickly under normal conditions.

core/src/main/java/io/questdb/cutlass/http/processors/ExportQueryProcessorState.java (1)

138-145: Verify HTTPSerialParquetExporter lifecycle cleanup.

close() nulls the exporter without freeing it. If it owns native buffers/threads or implements Closeable, this can leak; please confirm and free it via Misc.free(...).

♻️ Suggested cleanup (if Closeable)
-        serialParquetExporter = null;
+        serialParquetExporter = Misc.free(serialParquetExporter);
#!/bin/bash
# Inspect whether HTTPSerialParquetExporter implements Closeable or owns resources.
rg -n "class HTTPSerialParquetExporter" -g '*.java' -C2
rg -n "HTTPSerialParquetExporter" -g '*.java' -C2

@jerrinot jerrinot marked this pull request as ready for review February 2, 2026 08:51
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 143 / 148 (96.62%)

file detail

path covered line new line coverage
🔵 io/questdb/cutlass/http/HttpConnectionContext.java 29 33 87.88%
🔵 io/questdb/cutlass/http/processors/ExportQueryProcessor.java 26 27 96.30%
🔵 io/questdb/cutlass/http/HttpRequestProcessorSelector.java 1 1 100.00%
🔵 io/questdb/cutlass/http/processors/SqlValidationProcessor.java 6 6 100.00%
🔵 io/questdb/cutlass/http/processors/JsonQueryProcessor.java 16 16 100.00%
🔵 io/questdb/griffin/SqlExecutionContextImpl.java 4 4 100.00%
🔵 io/questdb/cutlass/http/processors/JsonQueryProcessorConfiguration.java 1 1 100.00%
🔵 io/questdb/cutlass/http/processors/JsonQueryProcessorState.java 6 6 100.00%
🔵 io/questdb/cutlass/http/HttpServer.java 37 37 100.00%
🔵 io/questdb/cutlass/http/processors/ExportQueryProcessorState.java 17 17 100.00%

@questdb-butler
Copy link
Copy Markdown

⚠️ Enterprise CI Failed

The enterprise test suite failed for this PR.

Build: View Details
Tested Commit: 87730ff842d7f96a014d6eef5853c8eef89628bf

Please investigate the failure before merging.

@bluestreak01 bluestreak01 merged commit 769456e into master Feb 4, 2026
43 of 44 checks passed
@bluestreak01 bluestreak01 deleted the jh_parquet_export_fix branch February 4, 2026 17:23
@freddyrios
Copy link
Copy Markdown

freddyrios commented Mar 5, 2026

There is another concurrency control issue in the master:

Example: HttpRequestProcessor is being executed by Worker0 and there is a PeerTooSlowToRead exception. We store resumeProcessor inside the HttpConnectionContext and I/O dispatcher waits for WRITE readiness.

Reminder: HTTP Processor instances are bound to a worker (thread). Thus the HttpConnectionContext stores an HTTP Request Processor instance which belongs to Worker0. So far so good.

Eventually, the socket is ready to accept WRITEs. So I/O dispatcher emits an event which can be handled by ANY I/O worker. Say Worker1 picks this event up. It calls the stored resumeProcessor. But wait, this instance of the resumeProcessor belongs to Worker0! If Worker0 concurrently handles requests for the same processor coming from another socket the we have both Worker0 and Worker1 accessing the same HTTP request processor instance concurrently! This is not great, since they both configure sqlConnectionContext, circuitbreaker, etc

Current solution: I made sql execution context and circuit breakers local to http connection context. that is each connection has its own sql context and CB. this fixes the most severe concurrency issues, but it does not address the root cause: Processors must be thread-safe. This defeats the purpose of creating a processor instance per worker. I see 2 options:

  1. When parking a processor we pin it to a specific worker. So it will always resume on the same worker which owns the processor instance. This is simple, but it limits scalability.
  2. Use indirection: Instead of storing a direct reference to resumeProcessor we store processor name/ID and on resume a worker resolves the ID to its local processor with this ID.

To be continued

@jerrinot Was this other issue affecting other type of queries not related to parquet? e.g. I have a case where queries checking ttl values seem to have mixed the results of diff tables (ran from diff connections):
select ttlValue from tables() where table_name = 'mytable' (there were maxing out of connections + timeouts in the mix)

EDIT: the behavior I observed was 100% unrelated.

@jerrinot
Copy link
Copy Markdown
Contributor Author

jerrinot commented Mar 7, 2026

hello @freddyrios, what's your conclusion? is it a QuestDB bug or something in your clients?

@freddyrios
Copy link
Copy Markdown

freddyrios commented Mar 7, 2026

hello @freddyrios, what's your conclusion? is it a QuestDB bug or something in your clients?

@jerrinot 100% on our side. It was a bug on our code, as we were caught by surprise by ttlValue changing to different units than the ones used to set it e.g. if one sets days to 21, it reports it in weeks so it returns 3. The maxing of connections & timeouts were due to using more parallelization than we should have for the operations that had the bug, combined with it resulting in heavier than intended operations,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants