Skip to content

fix(sql): fix intermittent failures with SQL window functions#6030

Merged
bluestreak01 merged 9 commits intomasterfrom
jh_racy_window_function_factories
Aug 7, 2025
Merged

fix(sql): fix intermittent failures with SQL window functions#6030
bluestreak01 merged 9 commits intomasterfrom
jh_racy_window_function_factories

Conversation

@jerrinot
Copy link
Copy Markdown
Contributor

@jerrinot jerrinot commented Aug 6, 2025

Summary

SQL queries using window functions could intermittently fail or produce incorrect results when multiple connections executed them concurrently. This PR fixes the race condition causing these failures.

Root Cause

Window function factories were incorrectly designed to be stateful - they stored execution context as instance fields:

  • AbstractWindowFunctionFactory.checkWindowParameter() would save WindowContext and window frame boundaries (rowsLo, rowsHi) as instance fields
  • Since QuestDB uses a single shared instance of each function factory, concurrent SQL compilation would overwrite each other's parameters
  • This caused race conditions where one query's window context would be replaced by another's mid-compilation

Fix

Made window function factories stateless. Removed all mutable state from AbstractWindowFunctionFactory.

Fixes #5989

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 6, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This change refactors window function factories to centralize window parameter validation and state management within the WindowContext abstraction. The checkWindowParameter method and related fields are removed from the abstract base class, and all affected window function factories now retrieve and validate window context directly. Additional validation logic is added to WindowContextImpl, and new tests are introduced.

Changes

Cohort / File(s) Change Summary
Abstract Factory Cleanup
core/src/main/java/io/questdb/griffin/engine/functions/window/AbstractWindowFunctionFactory.java
Removed protected fields (rowsHi, rowsLo, windowContext) and the checkWindowParameter method, eliminating parameter validation from the abstract base class.
Window Function Factories Refactor
core/src/main/java/io/questdb/griffin/engine/functions/window/AvgDoubleWindowFunctionFactory.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/CountConstWindowFunctionFactory.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/CountFunctionFactoryHelper.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/FirstValueDoubleWindowFunctionFactory.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/LastValueDoubleWindowFunctionFactory.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/MaxDoubleWindowFunctionFactory.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/MinDoubleWindowFunctionFactory.java,
core/src/main/java/io/questdb/griffin/engine/functions/window/SumDoubleWindowFunctionFactory.java
Replaced usage of the removed checkWindowParameter with explicit retrieval and validation of WindowContext from SqlExecutionContext. Updated logic to use window frame boundaries and context from WindowContext. Updated method signatures where necessary to pass WindowContext.
WindowContext Interface and Implementation
core/src/main/java/io/questdb/griffin/engine/window/WindowContext.java,
core/src/main/java/io/questdb/griffin/engine/window/WindowContextImpl.java
Added new validate(int, boolean) method to the interface and implemented comprehensive validation logic in the implementation. Introduced nulls handling fields and updated related getters. Adjusted logic for exclusion and frame boundary calculations.
Testing
core/src/test/java/io/questdb/test/griffin/engine/window/WindowFunctionTest.java
Added two test cases for window frame exclusion logic, verifying correct behavior for last_value(j) with exclusion of the current row in the frame.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SqlExecutionContext
    participant WindowFunctionFactory
    participant WindowContext

    Client->>WindowFunctionFactory: newInstance(...)
    WindowFunctionFactory->>SqlExecutionContext: getWindowContext()
    SqlExecutionContext-->>WindowFunctionFactory: WindowContext
    WindowFunctionFactory->>WindowContext: validate(position, supportNullsDesc)
    WindowContext-->>WindowFunctionFactory: (throws SqlException if invalid)
    WindowFunctionFactory->>WindowContext: getRowsLo()/getRowsHi()/etc.
    WindowFunctionFactory-->>Client: Function instance
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Assessment against linked issues

Objective Addressed Explanation
Fix window function error and assertion failure in PG Wire demo query caused by incorrect window parameter validation and handling (#5989)
Centralize and strengthen window parameter validation for window function factories (#5989)
Add regression tests for window frame exclusion and edge cases (#5989)

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes detected. All code changes are directly related to window function validation, state management, and regression testing as required by the linked issue.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jh_racy_window_function_factories

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@jerrinot jerrinot marked this pull request as draft August 6, 2025 13:42
@jerrinot jerrinot changed the title fix(sql): intermittent failures with SQL window function fix(sql): fix intermittent failures with SQL window functions Aug 6, 2025
@jerrinot jerrinot marked this pull request as ready for review August 6, 2025 16:46
@jerrinot jerrinot requested a review from kafka1991 August 6, 2025 16:47
@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit please review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 6, 2025

@bluestreak01 I'll review the changes in this PR that fixes the race condition with SQL window functions.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
core/src/main/java/io/questdb/griffin/engine/functions/window/LastValueDoubleWindowFunctionFactory.java (1)

1764-1771: Incorrect constant used when detecting an unbounded lower bound in the execution-plan printer

rowLo is compared to Long.MAX_VALUE, but an unbounded preceding lower bound is encoded as Long.MIN_VALUE everywhere else in the engine.
This makes toPlan() emit “unbounded” only for the impossible case of Long.MAX_VALUE, while the valid UNBOUNDED PRECEDING path prints an absolute number instead.

-            if (rowLo == Long.MAX_VALUE) {
+            if (rowLo == Long.MIN_VALUE) {
                 sink.val("unbounded");
             } else {
                 sink.val(Math.abs(rowLo));
             }
🧹 Nitpick comments (3)
core/src/test/java/io/questdb/test/griffin/engine/window/WindowFunctionTest.java (1)

2252-2267: Test logic is correct but consider extracting to separate method.

The window function test correctly validates the "exclude current row" behavior. The expected results properly reflect that when the current row is excluded from the frame, the function returns the last value from the remaining rows in the frame (or null when the frame becomes empty).

However, this inline test addition within an existing method makes the test harder to maintain and understand.

Consider extracting this test case to a separate method for better organization:

+@Test
+public void testLastValueWithCurrentRowExclusion() throws Exception {
+    assertMemoryLeak(() -> {
+        execute("create table tab (ts timestamp, i long, j long, d double, s symbol, c VARCHAR) timestamp(ts)");
+        execute("insert into tab select x::timestamp, x/4, case when x % 3 = 0 THEN NULL ELSE x%5 END, x%5, 'k' || (x%5) ::symbol, 'k' || x from long_sequence(7)");
+        
+        assertQueryNoLeakCheck(
+            "ts\ti\tj\tlast_value\n" +
+                "1970-01-01T00:00:00.000001Z\t0\t1\tnull\n" +
+                "1970-01-01T00:00:00.000002Z\t0\t2\t1.0\n" +
+                "1970-01-01T00:00:00.000003Z\t0\tnull\t2.0\n" +
+                "1970-01-01T00:00:00.000004Z\t1\t4\tnull\n" +
+                "1970-01-01T00:00:00.000005Z\t1\t0\t4.0\n" +
+                "1970-01-01T00:00:00.000006Z\t1\tnull\t0.0\n" +
+                "1970-01-01T00:00:00.000007Z\t1\t2\tnull\n",
+            "select ts, i, j,\n" +
+                "last_value(j) over (partition by i order by ts rows between 2 preceding and current row exclude current row)\n" +
+                "from tab;",
+            "ts",
+            false,
+            true
+        );
+    });
+}
core/src/main/java/io/questdb/griffin/engine/functions/window/LastValueDoubleWindowFunctionFactory.java (2)

2039-2045: Minor wording glitch in generated execution plan

toPlan() currently outputs “preceding between and”, which contains a duplicated “between” token and does not match SQL grammar.

-                sink.val(Math.abs(rowsLo));
-                sink.val(" preceding between and ");
+                sink.val(Math.abs(rowsLo));
+                sink.val(" preceding and ");

98-110: Large duplication between NULL-handling branches

generateIgnoreNullsFunction() and generateRespectNullsFunction() are ~90 % identical, differing only in the concrete leaf classes they instantiate.
Consider extracting the common decision tree into a single private helper that takes functional interfaces (or an enum) for the two variants.
This will:

• cut ~700 lines of boilerplate
• reduce the chance of future divergences between the two code paths
• simplify testing and maintenance.

Also applies to: 284-296

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55b71a4 and 8d50b89.

📒 Files selected for processing (12)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/AbstractWindowFunctionFactory.java (0 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/AvgDoubleWindowFunctionFactory.java (2 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/CountConstWindowFunctionFactory.java (2 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/CountFunctionFactoryHelper.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/FirstValueDoubleWindowFunctionFactory.java (4 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/LastValueDoubleWindowFunctionFactory.java (4 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/MaxDoubleWindowFunctionFactory.java (2 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/MinDoubleWindowFunctionFactory.java (2 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/SumDoubleWindowFunctionFactory.java (2 hunks)
  • core/src/main/java/io/questdb/griffin/engine/window/WindowContext.java (3 hunks)
  • core/src/main/java/io/questdb/griffin/engine/window/WindowContextImpl.java (6 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/window/WindowFunctionTest.java (2 hunks)
💤 Files with no reviewable changes (1)
  • core/src/main/java/io/questdb/griffin/engine/functions/window/AbstractWindowFunctionFactory.java
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: in doublearrayaccessfunctionfactory.java, overflow validation for array index casting from long to i...
Learnt from: mtopolnik
PR: questdb/questdb#5997
File: core/src/main/java/io/questdb/griffin/engine/functions/array/DoubleArrayAccessFunctionFactory.java:127-131
Timestamp: 2025-07-31T08:21:17.390Z
Learning: In DoubleArrayAccessFunctionFactory.java, overflow validation for array index casting from long to int is performed in the validateIndexArgs() method rather than at each individual cast site. The validateIndexArgs() method is called early in newInstance() to validate all index arguments, making subsequent casts safe.

Applied to files:

  • core/src/main/java/io/questdb/griffin/engine/functions/window/MinDoubleWindowFunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/functions/window/SumDoubleWindowFunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/functions/window/AvgDoubleWindowFunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/functions/window/MaxDoubleWindowFunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/functions/window/CountConstWindowFunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/window/WindowContextImpl.java
  • core/src/main/java/io/questdb/griffin/engine/functions/window/FirstValueDoubleWindowFunctionFactory.java
📚 Learning: in questdb ilp tests, maximum dimension values like `(1 << 28) - 1` are intentionally hard-coded ins...
Learnt from: jerrinot
PR: questdb/questdb#5996
File: core/src/test/java/io/questdb/test/cutlass/http/line/LineHttpSenderTest.java:350-356
Timestamp: 2025-08-04T09:16:27.366Z
Learning: In QuestDB ILP tests, maximum dimension values like `(1 << 28) - 1` are intentionally hard-coded instead of using constants like `ArrayView.DIM_MAX_LEN` to serve as "canaries" that will cause test failures if the API limits change, ensuring developers are aware of breaking changes to the ILP API.

Applied to files:

  • core/src/main/java/io/questdb/griffin/engine/functions/window/MinDoubleWindowFunctionFactory.java
🔇 Additional comments (16)
core/src/main/java/io/questdb/griffin/engine/functions/window/SumDoubleWindowFunctionFactory.java (2)

47-47: Import addition aligns with the refactoring objectives.

The import of WindowContext is necessary for the new stateless validation approach that eliminates race conditions.


76-83: Excellent refactoring to eliminate race conditions.

This change successfully addresses the race condition issue described in the PR objectives by:

  1. Obtaining WindowContext from SqlExecutionContext instead of relying on shared factory state
  2. Using the centralized validate() method for parameter validation
  3. Extracting frame boundaries directly from the context rather than accessing instance fields

This makes the window function factory stateless and thread-safe for concurrent SQL compilations.

core/src/main/java/io/questdb/griffin/engine/functions/window/CountFunctionFactoryHelper.java (1)

68-71: Helper class correctly adopts the new stateless validation pattern.

The refactoring successfully eliminates dependency on AbstractWindowFunctionFactory's stateful validation by:

  • Obtaining WindowContext directly from SqlExecutionContext
  • Using the centralized validate() method
  • Extracting frame boundaries from the context

This ensures thread safety for concurrent window function compilations.

core/src/main/java/io/questdb/griffin/engine/functions/window/MinDoubleWindowFunctionFactory.java (2)

38-38: Import addition supports the stateless refactoring.

Adding the WindowContext import enables the new thread-safe validation pattern.


64-71: Consistent implementation of the race condition fix.

This factory correctly adopts the new stateless validation approach:

  • Retrieves WindowContext from SqlExecutionContext instead of using shared factory state
  • Performs validation through the centralized validate() method
  • Extracts window frame boundaries directly from the context

This eliminates the race condition that occurred when multiple concurrent SQL compilations overwrote each other's parameters in shared factory instances.

core/src/main/java/io/questdb/griffin/engine/window/WindowContext.java (3)

30-30: Import addition supports the new validation method.

The SqlException import is required for the new validate() method signature.


41-42: Method reordering improves interface organization.

The repositioning of getNullsDescPos() and isIgnoreNulls() methods doesn't change functionality but may provide better logical grouping within the interface.

Also applies to: 65-66


71-71: Critical method addition enables centralized validation.

The new validate(int position, boolean supportTNullsDesc) method is essential for the race condition fix. It centralizes window parameter validation logic that was previously scattered across individual factory instances, eliminating the shared mutable state that caused race conditions in concurrent SQL compilations.

core/src/main/java/io/questdb/griffin/engine/functions/window/AvgDoubleWindowFunctionFactory.java (3)

48-48: LGTM: Necessary import for WindowContext refactoring.

The import addition is correctly placed and supports the architectural change to centralize window parameter validation within the WindowContext abstraction.


82-83: LGTM: Correct implementation of stateless validation.

This change successfully eliminates the race condition by replacing the stateful checkWindowParameter method with WindowContext-based validation. The WindowContext is obtained from the execution context (likely thread-local) and validation is performed without modifying shared factory state.


89-90: LGTM: Proper extraction of window frame parameters.

The window frame boundaries are now correctly extracted from the validated WindowContext instead of being accessed as instance fields. This completes the stateless refactoring and ensures thread-safety while maintaining the same functional behavior.

core/src/main/java/io/questdb/griffin/engine/functions/window/MaxDoubleWindowFunctionFactory.java (2)

46-46: LGTM - Import added for WindowContext interface.

The import statement correctly adds the WindowContext interface needed for the new stateless approach.


82-89: Excellent fix for the race condition issue!

This refactoring successfully addresses the core race condition problem by:

  1. Eliminating shared state: WindowContext is now retrieved per-execution from SqlExecutionContext instead of being stored as instance fields
  2. Thread-safe parameter extraction: rowsLo and rowsHi are extracted locally within method scope rather than relying on potentially overwritten instance fields
  3. Proper validation: The windowContext.validate() call replaces the removed checkWindowParameter() method while maintaining the same validation logic

This change ensures that concurrent SQL compilations no longer overwrite each other's window parameters, resolving the intermittent failures described in issue #5989.

core/src/main/java/io/questdb/griffin/engine/functions/window/CountConstWindowFunctionFactory.java (1)

43-43: LGTM! Correctly fixes the race condition.

The refactoring properly removes mutable state from the shared factory instance by delegating window parameter validation and storage to the WindowContext. This eliminates the race condition where concurrent SQL compilations would overwrite each other's parameters.

Also applies to: 62-69

core/src/main/java/io/questdb/griffin/engine/window/WindowContextImpl.java (1)

43-44: Well-structured centralized validation logic.

The new validation method and null handling fields properly centralize window parameter validation, eliminating the need for scattered validation logic across different factory classes. The getRowsHi() adjustment for EXCLUDE_CURRENT_ROW is correctly implemented.

Also applies to: 97-100, 122-124, 162-164, 167-169, 214-254

core/src/main/java/io/questdb/griffin/engine/functions/window/FirstValueDoubleWindowFunctionFactory.java (1)

46-46: Consistent refactoring that eliminates the race condition.

The changes properly delegate window parameter validation and storage to WindowContext, following the same pattern as other window function factories. This ensures thread-safety by removing mutable state from the shared factory instance.

Also applies to: 77-92, 99-106, 301-310

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 71 / 71 (100.00%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/functions/window/CountConstWindowFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/window/MinDoubleWindowFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/window/MaxDoubleWindowFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/window/AvgDoubleWindowFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/window/CountFunctionFactoryHelper.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/window/SumDoubleWindowFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/window/FirstValueDoubleWindowFunctionFactory.java 10 10 100.00%
🔵 io/questdb/griffin/engine/functions/window/AbstractWindowFunctionFactory.java 1 1 100.00%
🔵 io/questdb/griffin/engine/window/WindowContextImpl.java 26 26 100.00%
🔵 io/questdb/griffin/engine/functions/window/LastValueDoubleWindowFunctionFactory.java 10 10 100.00%

@jerrinot
Copy link
Copy Markdown
Contributor Author

jerrinot commented Aug 7, 2025

CI glitch: ##[error]No agent found in pool arm64 which satisfies the following demand: Agent.Name. All demands: Agent.Name -equals arm64-SelfHostedRunOther170250, maven, Agent.Version -gtVersion 2.182.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky PG Wire with Window function error on demo query

4 participants