Skip to content

ci(build): move cover jobs to Hetzner#6308

Merged
bluestreak01 merged 101 commits intomasterfrom
jh_selfhosted_cover
Oct 31, 2025
Merged

ci(build): move cover jobs to Hetzner#6308
bluestreak01 merged 101 commits intomasterfrom
jh_selfhosted_cover

Conversation

@jerrinot
Copy link
Copy Markdown
Contributor

@jerrinot jerrinot commented Oct 24, 2025

It moves the coverage testing to Hetzner. This free-up Azure Pipelines Managed runners for MacOs and Windows builds and reduce queuing.

@jerrinot jerrinot added CI Continuous Integration & Builds DO NOT MERGE These changes should not be merged to main branch labels Oct 24, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 24, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR migrates CI infrastructure from Azure Pipelines to self-hosted Hetzner Docker runners, restructures test coverage jobs with refined test filtering and parallel execution, updates JVM runtime configuration for parallel GC and compiler settings, and increases SQL recompilation retry limits from hardcoded 10 to configurable value.

Changes

Cohort / File(s) Summary
CI Infrastructure Migration
ci/templates/self-hosted-cover-jobs.yml
Restructured matrix with new jobs: linux-griffin-root/linux-griffin-sub, linux-fuzz1/linux-fuzz2, linux-cairo-root/linux-cairo-sub. Updated all poolName values to "hetzner-docker", refined testset names and test filtering patterns (includeTests/excludeTests) per job.
CI Conditional Adjustments
ci/templates/steps.yml
Added conditional poolName checks: Linux steps skip when poolName is hetzner-docker; new JDK/Maven install step for hetzner-docker only. Updated Maven caching logic and Rust/coverage conditions to exclude hetzner-docker. Minor formatting fixes in Maven arguments.
CI Pipeline Stage Migration
ci/test-pipeline.yml
Renamed stage HostedRunTestsCoverageBranches to SelfHostedRunTestsCoverageBranches, swapped template from hosted-cover-jobs.yml to self-hosted-cover-jobs.yml, updated dependent Coverage Report stage.
JVM and Build Configuration
core/pom.xml
Added JVM option -XX:+UseParallelGC to argLine; injected -J-XX:-TieredCompilation and -J-XX:TieredStopAtLevel=1 into Maven compiler plugin configuration.
Configuration Constants
core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java
Bumped getMaxSqlRecompileAttempts() from 10 to 50; added public method getParquetExportCopyReportFrequencyLines() returning 500_000; reordered declarations of getParquetExportTableNamePrefix() and getParquetExportVersion().
SQL Compilation Retry Logic
core/src/main/java/io/questdb/cairo/wal/OperationExecutor.java
Introduced private field maxRecompilationAttempts initialized from engine configuration; replaced hardcoded retry limit of 10 with configurable limit in ALTER SQL compilation retry loop.
IO-uring Validation
core/src/main/java/io/questdb/std/IOURingFacadeImpl.java
Enhanced static initialization: added two-step validation combining kernel version check (isAvailableOn) with runtime io_uring creation attempt (IOUringAccessor.create), marking unavailable on creation failure.
Test Infrastructure
core/src/test/java/io/questdb/test/ServerMainQueryTimeoutTest.java
Added localLeaks counter tracking connection leaks; loop terminates early if leaks reach 5, reducing iterations when leaks detected.
Test Timeout Adjustment
core/src/test/java/io/questdb/test/cutlass/http/MetricsIODispatcherTest.java
Increased HTTP response timeout in testPrometheusScenario from 5,000 ms to 15,000 ms.
Test Error Handling
core/src/test/java/io/questdb/test/griffin/SqlCompilerImplTest.java
Wrapped query result free loop with try-catch for SqlException; asserts Flyweight message contains "too many cached query" on exception.
Test Formatting
compat/src/test/java/io/questdb/compat/InfluxDBClientFailureTest.java
Refactored expected JSON error message construction using string literal concatenation with inline explanatory comment; no behavioral change.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Attention areas:
    • CI matrix restructuring in self-hosted-cover-jobs.yml requires verification that test filter patterns (includeTests/excludeTests) correctly segregate tests across new job matrix, particularly griffin and cairo test coverage boundaries
    • Conditional poolName logic in steps.yml should be validated to ensure correct execution paths for both Azure and Hetzner runners
    • Verify getMaxSqlRecompileAttempts() increase from 10 to 50 doesn't significantly impact performance or test timeouts
    • IO-uring initialization change in IOURingFacadeImpl adds runtime overhead during startup; confirm acceptable performance impact
    • Test timeout increases (HTTP and potential compiler timeouts) should be validated as necessary adjustments rather than masking performance regressions

Suggested labels

Enhancement

Suggested reviewers

  • nwoolmer

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "ci(build): move cover jobs to Hetzner" is directly related to the primary objective of the changeset. The pull request's main changes involve migrating CI coverage jobs from Azure Pipelines to Hetzner-hosted infrastructure, as evidenced by the restructuring of ci/templates/self-hosted-cover-jobs.yml and updates to ci/test-pipeline.yml that swap the template usage and update poolName values from "Azure Pipelines" to "hetzner-docker". While the changeset also includes supporting modifications to Java source code and test files (such as JVM options in core/pom.xml and adjustments to test retry logic), the title appropriately captures the primary objective and main change. The title is specific, clear, and avoids vague terminology.
Description Check ✅ Passed The pull request description is related to the changeset and clearly articulates the main objective. It states "It moves the coverage testing to Hetzner" and explains the motivation: "This free-up Azure Pipelines Managed runners for MacOs and Windows builds and reduce queuing." This aligns with the actual CI template changes that migrate coverage jobs to self-hosted infrastructure on Hetzner with updated poolName values. While the description is brief and does not enumerate all supporting changes (such as Java configuration and test modifications), it is not vague or generic—it provides specific and meaningful information about the core purpose of the changes. The description clearly relates to a significant portion of the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 31, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
ci/templates/steps.yml (1)

260-274: 🚨 CRITICAL: Testset reference mismatch breaks Rust binary upload.

Lines 263 and 271 reference testset: 'coverage-griffin', but the new job matrix (self-hosted-cover-jobs.yml) defines only 'coverage-griffin-root' and 'coverage-griffin-sub'. This condition will never match, causing Rust binaries to not upload for coverage.

The downstream coverage report pipeline (test-pipeline.yml lines 98–100) expects these binaries; their absence will degrade Rust coverage reporting.

Apply this diff to fix the condition:

     displayName: "Prepare Rust binaries for upload"
     condition: |
       and(
-        eq(variables['testset'], 'coverage-griffin'),
+        or(
+          eq(variables['testset'], 'coverage-griffin-root'),
+          eq(variables['testset'], 'coverage-griffin-sub')
+        ),
         eq(variables['RUST_SOURCE_CODE_CHANGED'], 'true'),
         eq(variables['System.PullRequest.IsFork'], 'false')
       )
   - task: PublishBuildArtifacts@1
     displayName: "Upload Rust binaries needed to merge coverage files"
     condition: |
       and(
-        eq(variables['testset'], 'coverage-griffin'),
+        or(
+          eq(variables['testset'], 'coverage-griffin-root'),
+          eq(variables['testset'], 'coverage-griffin-sub')
+        ),
         eq(variables['RUST_SOURCE_CODE_CHANGED'], 'true'),
         eq(variables['System.PullRequest.IsFork'], 'false')
       )
ci/templates/self-hosted-cover-jobs.yml (1)

8-83: Fix test pattern overlaps in fuzz2 job configuration.

Verification reveals 29 tests executing across multiple jobs, wasting CI resources:

  • Root cause: fuzz2's includeTests: "**/O3*Test**,**/**Fuzz**" overlaps with cairo and other jobs

    • O3 tests (e.g., AbstractO3Test.java) matched by both fuzz2 and cairo jobs
    • Fuzz tests outside cairo (e.g., HttpSenderMemoryPressureFuzzTest.java) matched by both fuzz2 and other jobs
  • Impact: 29 duplicate test executions across the matrix

  • Solution: Tighten fuzz2's includeTests and/or excludeTests to avoid overlap, or restructure to use path-based partitioning (e.g., include only O3 and Fuzz tests from specific modules not already covered by cairo or pgwire).

🧹 Nitpick comments (4)
compat/src/test/java/io/questdb/compat/InfluxDBClientFailureTest.java (1)

172-176: Fix typo in comment.

The comment provides helpful context for the non-deterministic errno behavior. However, there's a typo on line 175: "thus is not determistic" should be "thus is not deterministic".

Apply this diff:

-        // gets 0. 
+        // gets 0.
         "{\"code\":\"internal error\",\"message\":\"failed to parse line protocol:errors encountered on line(s):write error: failed_table, errno: ", "error: could not open read-write"

Also fix the typo:

-        // whatever errno was set at the last actual failure. thus is not determistic, on Windows it occasionally
+        // whatever errno was set at the last actual failure. thus is not deterministic, on Windows it occasionally
core/src/test/java/io/questdb/test/griffin/SqlCompilerImplTest.java (1)

6275-6288: Good defensive catch; consider always freeing via finally.

Logic is fine and verifies the intended transient failure mode. Minor robustness: capture the factory and free it in finally to guarantee release even if assertions later throw.

-                    try {
-                        Misc.freeIfCloseable(select("select * from x"));
-                    } catch (SqlException e) {
-                        // This loop is a stress test intended to provoke a race condition
-                        // in the query compiler. We are specifically looking for the
-                        // "too many cached query" exception, as this error confirms
-                        // the compiler *did* attempt to retry (as required) before
-                        // exhausting those retries due to the race.
-                        //
-                        // We only tolerate this specific error. the assertion ensures
-                        // any other SqlException (e.g., a failure without retrying)
-                        // still fails the test.
-                        TestUtils.assertContains(e.getFlyweightMessage(), "too many cached query");
-                    }
+                    RecordCursorFactory f = null;
+                    try {
+                        f = select("select * from x");
+                    } catch (SqlException e) {
+                        // See comment above: tolerate only the specific cached-query exhaustion.
+                        TestUtils.assertContains(e.getFlyweightMessage(), "too many cached query");
+                    } finally {
+                        Misc.freeIfCloseable(f);
+                    }
core/src/test/java/io/questdb/test/ServerMainQueryTimeoutTest.java (1)

113-113: Consider renaming localLeaks and documenting the threshold.

The variable name localLeaks is misleading because it counts timeouts, not confirmed resource leaks. A timeout may correlate with a potential leak, but doesn't confirm one. Additionally, the magic number 5 lacks justification—why is 5 the appropriate threshold for stopping iterations?

Apply this diff to improve clarity:

-                            for (int i = 0, localLeaks = 0; i < nIterations && localLeaks < 5; i++) {
+                            // Limit consecutive timeouts to prevent accumulating potentially leaked resources
+                            for (int i = 0, timeoutCount = 0; i < nIterations && timeoutCount < 5; i++) {

And update line 145 accordingly:

-                                    localLeaks++;
+                                    timeoutCount++;

Consider whether the threshold of 5 is appropriate for your CI environment, or if it should be configurable.

ci/templates/self-hosted-cover-jobs.yml (1)

84-87: I'll verify the pool configuration by examining the actual YAML file and researching Azure Pipelines documentation on how vmImage and name interact.

Based on the Azure Pipelines documentation, I can now verify the pool configuration concern:

vmImage is valid only in the Microsoft-hosted pool. When both name (self-hosted) and vmImage are specified in a pool configuration, to use a Microsoft-hosted pool, omit the name and specify one of the available hosted images. This means the name: $(poolName) will take precedence and the pipeline will correctly execute on the self-hosted hetzner-docker pool, making the vmImage: $(imageName) redundant.

The configuration is functional but non-standard. The pool will use the self-hosted "hetzner-docker" pool as intended, but setting vmImage on a self-hosted pool configuration is unnecessary and could cause confusion.


Remove redundant vmImage from self-hosted pool configuration.

The pool specifies both vmImage: $(imageName) and name: $(poolName). Since vmImage is only used for Microsoft-hosted pools, it should be removed when using a self-hosted pool (name). The pipeline will correctly run on hetzner-docker, but the configuration is cleaner without the unused vmImage property.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d979ed8 and 395a6e3.

📒 Files selected for processing (11)
  • ci/templates/self-hosted-cover-jobs.yml (1 hunks)
  • ci/templates/steps.yml (3 hunks)
  • ci/test-pipeline.yml (2 hunks)
  • compat/src/test/java/io/questdb/compat/InfluxDBClientFailureTest.java (1 hunks)
  • core/pom.xml (2 hunks)
  • core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java (3 hunks)
  • core/src/main/java/io/questdb/cairo/wal/OperationExecutor.java (3 hunks)
  • core/src/main/java/io/questdb/std/IOURingFacadeImpl.java (1 hunks)
  • core/src/test/java/io/questdb/test/ServerMainQueryTimeoutTest.java (2 hunks)
  • core/src/test/java/io/questdb/test/cutlass/http/MetricsIODispatcherTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/SqlCompilerImplTest.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (31)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (Hosted Running tests on windows-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (14)
core/src/test/java/io/questdb/test/cutlass/http/MetricsIODispatcherTest.java (1)

273-273: Verify timeout increase rationale and consistency.

The HTTP response await timeout has been tripled to 15 seconds. While this may accommodate the performance characteristics of the new Hetzner infrastructure, consider:

  1. Is this increase based on observed timeout failures on Hetzner, or is it a preventive measure?
  2. The recv timeout on line 284 remains at 5 seconds. Should it also be increased for consistency, or is the longer initial response time specific to connection establishment?
  3. Does this timeout increase mask a performance regression that should be investigated separately?

To verify whether the new infrastructure exhibits significantly different latency characteristics, you could add temporary logging or metrics around these HTTP operations to understand typical response times on Hetzner versus the previous Azure environment.

core/src/main/java/io/questdb/cairo/wal/OperationExecutor.java (3)

47-47: Good addition of configurable retry limit.

The new field correctly captures the maximum recompilation attempts from the engine configuration, making the previously hardcoded limit (10) configurable.


68-68: Field initialization is correctly placed.

The maxRecompilationAttempts field is initialized after the engine assignment, ensuring the configuration is available.


102-102: Retry limit now uses configurable value.

The hardcoded limit of 10 has been replaced with the configurable maxRecompilationAttempts field, which is now set to 50 in the default configuration (see DefaultCairoConfiguration.java line 549).

core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java (3)

548-550: Verify the 5x increase in retry limit is appropriate.

The maximum SQL recompilation attempts increased from 10 to 50. While this provides more resilience for transient errors during table renames (as mentioned in the retry logic comment at OperationExecutor.java line 98-100), it could also delay error reporting by up to 5x in scenarios where the issue persists.

Confirm this increase aligns with expected retry behavior and doesn't introduce unacceptable delays in error scenarios.


678-681: New Parquet export configuration added.

The new method getParquetExportCopyReportFrequencyLines() returns 500,000, which appears to control the frequency of copy progress reporting for Parquet exports. The value seems reasonable for batch reporting.


694-701: Method reordering without functional changes.

The getParquetExportTableNamePrefix() and getParquetExportVersion() methods have been reordered but their return values and signatures remain unchanged. This has no functional impact.

core/pom.xml (2)

36-36: Add UseParallelGC to argLine for CI performance.

Parallel garbage collection is a reasonable choice for CI/test environments to optimize throughput. However, pair this with documentation explaining the configuration's impact on build time and test execution on self-hosted runners.


87-88: Remove redundant flag -XX:TieredStopAtLevel=1 and clarify the intent of tiered compilation disabling.

The flags -XX:-TieredCompilation and -XX:TieredStopAtLevel=1 are redundant in current HotSpot—when tiered compilation is disabled, TieredStopAtLevel is ignored. The second flag has no effect and should be removed.

This configuration lacks documentation explaining why tiered compilation is disabled. Before proceeding, verify:

  1. Is the intent to reduce compilation overhead, improve determinism, or maintain compatibility with older JVM versions?
  2. Has this been measured to impact build times on your CI environment?

If the goal is to constrain JIT compilation, use only TieredStopAtLevel=1 (with tiered compilation enabled) and remove the disable flag. If you must disable tiered compilation entirely, keep only -XX:-TieredCompilation and add an explanatory comment about the tradeoff.

core/src/main/java/io/questdb/std/IOURingFacadeImpl.java (1)

107-110: Excellent documentation of the defensive probe rationale.

The comment clearly explains why runtime verification is necessary beyond kernel version checks, especially relevant for containerized or restricted environments (like the Hetzner CI runners mentioned in the PR).

ci/templates/self-hosted-cover-jobs.yml (1)

3-4: Adding workspace clean for self-hosted runners.

Workspace clean: all is a good practice for self-hosted runners to free up disk space, though it may add slight overhead to job startup. This is acceptable for coverage jobs that run less frequently.

ci/templates/steps.yml (2)

144-150: Whitespace alignment fix approved.

Line 148 removes trailing space from -Dout argument. This is good hygiene and doesn't affect functionality.


13-21: No issues found—the install step is correct and necessary for hetzner-docker.

The step runs only on poolName == 'hetzner-docker' because these self-hosted runners don't come with pre-installed Java/Maven. The path /usr/lib/jvm/java-17-openjdk-amd64 is the standard Ubuntu location for the openjdk-17-jdk package and will be created by apt-get install. The Azure Pipelines Maven@3 task resolves Java via the jdk variable (set to 1.17), and the step correctly sets both JAVA_HOME and JAVA_HOME_17_X64 for downstream use. No conflicts or issues detected.

Likely an incorrect or invalid review comment.

ci/test-pipeline.yml (1)

57-62: No code issues found—internal references to old stage name are absent.

The search confirms the old stage name HostedRunTestsCoverageBranches is completely removed from the codebase. The rename is applied consistently within ci/test-pipeline.yml, with no broken references left behind.

The review's concern about external systems (monitoring dashboards, notifications, automation scripts) referencing the old name is valid but falls outside code verification scope. These dependencies would need to be coordinated separately with the teams managing those systems.

@jerrinot jerrinot removed the DO NOT MERGE These changes should not be merged to main branch label Oct 31, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 12 / 13 (92.31%)

file detail

path covered line new line coverage
🔵 io/questdb/std/IOURingFacadeImpl.java 6 7 85.71%
🔵 io/questdb/cairo/wal/OperationExecutor.java 2 2 100.00%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 4 4 100.00%

@bluestreak01 bluestreak01 merged commit 5e010f7 into master Oct 31, 2025
37 checks passed
@bluestreak01 bluestreak01 deleted the jh_selfhosted_cover branch October 31, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Continuous Integration & Builds

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants