ci(build): move cover jobs to Hetzner#6308
Conversation
split cairo tests
it's a stopgap measure, eventually the maven should be baked into the image
this time for real
This reverts commit 6beac6a.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis PR migrates CI infrastructure from Azure Pipelines to self-hosted Hetzner Docker runners, restructures test coverage jobs with refined test filtering and parallel execution, updates JVM runtime configuration for parallel GC and compiler settings, and increases SQL recompilation retry limits from hardcoded 10 to configurable value. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ci/templates/steps.yml (1)
260-274: 🚨 CRITICAL: Testset reference mismatch breaks Rust binary upload.Lines 263 and 271 reference
testset: 'coverage-griffin', but the new job matrix (self-hosted-cover-jobs.yml) defines only'coverage-griffin-root'and'coverage-griffin-sub'. This condition will never match, causing Rust binaries to not upload for coverage.The downstream coverage report pipeline (test-pipeline.yml lines 98–100) expects these binaries; their absence will degrade Rust coverage reporting.
Apply this diff to fix the condition:
displayName: "Prepare Rust binaries for upload" condition: | and( - eq(variables['testset'], 'coverage-griffin'), + or( + eq(variables['testset'], 'coverage-griffin-root'), + eq(variables['testset'], 'coverage-griffin-sub') + ), eq(variables['RUST_SOURCE_CODE_CHANGED'], 'true'), eq(variables['System.PullRequest.IsFork'], 'false') ) - task: PublishBuildArtifacts@1 displayName: "Upload Rust binaries needed to merge coverage files" condition: | and( - eq(variables['testset'], 'coverage-griffin'), + or( + eq(variables['testset'], 'coverage-griffin-root'), + eq(variables['testset'], 'coverage-griffin-sub') + ), eq(variables['RUST_SOURCE_CODE_CHANGED'], 'true'), eq(variables['System.PullRequest.IsFork'], 'false') )ci/templates/self-hosted-cover-jobs.yml (1)
8-83: Fix test pattern overlaps in fuzz2 job configuration.Verification reveals 29 tests executing across multiple jobs, wasting CI resources:
Root cause: fuzz2's
includeTests: "**/O3*Test**,**/**Fuzz**"overlaps with cairo and other jobs
- O3 tests (e.g.,
AbstractO3Test.java) matched by both fuzz2 and cairo jobs- Fuzz tests outside cairo (e.g.,
HttpSenderMemoryPressureFuzzTest.java) matched by both fuzz2 and other jobsImpact: 29 duplicate test executions across the matrix
Solution: Tighten fuzz2's
includeTestsand/orexcludeTeststo avoid overlap, or restructure to use path-based partitioning (e.g., include only O3 and Fuzz tests from specific modules not already covered by cairo or pgwire).
🧹 Nitpick comments (4)
compat/src/test/java/io/questdb/compat/InfluxDBClientFailureTest.java (1)
172-176: Fix typo in comment.The comment provides helpful context for the non-deterministic errno behavior. However, there's a typo on line 175: "thus is not determistic" should be "thus is not deterministic".
Apply this diff:
- // gets 0. + // gets 0. "{\"code\":\"internal error\",\"message\":\"failed to parse line protocol:errors encountered on line(s):write error: failed_table, errno: ", "error: could not open read-write"Also fix the typo:
- // whatever errno was set at the last actual failure. thus is not determistic, on Windows it occasionally + // whatever errno was set at the last actual failure. thus is not deterministic, on Windows it occasionallycore/src/test/java/io/questdb/test/griffin/SqlCompilerImplTest.java (1)
6275-6288: Good defensive catch; consider always freeing via finally.Logic is fine and verifies the intended transient failure mode. Minor robustness: capture the factory and free it in finally to guarantee release even if assertions later throw.
- try { - Misc.freeIfCloseable(select("select * from x")); - } catch (SqlException e) { - // This loop is a stress test intended to provoke a race condition - // in the query compiler. We are specifically looking for the - // "too many cached query" exception, as this error confirms - // the compiler *did* attempt to retry (as required) before - // exhausting those retries due to the race. - // - // We only tolerate this specific error. the assertion ensures - // any other SqlException (e.g., a failure without retrying) - // still fails the test. - TestUtils.assertContains(e.getFlyweightMessage(), "too many cached query"); - } + RecordCursorFactory f = null; + try { + f = select("select * from x"); + } catch (SqlException e) { + // See comment above: tolerate only the specific cached-query exhaustion. + TestUtils.assertContains(e.getFlyweightMessage(), "too many cached query"); + } finally { + Misc.freeIfCloseable(f); + }core/src/test/java/io/questdb/test/ServerMainQueryTimeoutTest.java (1)
113-113: Consider renaminglocalLeaksand documenting the threshold.The variable name
localLeaksis misleading because it counts timeouts, not confirmed resource leaks. A timeout may correlate with a potential leak, but doesn't confirm one. Additionally, the magic number5lacks justification—why is 5 the appropriate threshold for stopping iterations?Apply this diff to improve clarity:
- for (int i = 0, localLeaks = 0; i < nIterations && localLeaks < 5; i++) { + // Limit consecutive timeouts to prevent accumulating potentially leaked resources + for (int i = 0, timeoutCount = 0; i < nIterations && timeoutCount < 5; i++) {And update line 145 accordingly:
- localLeaks++; + timeoutCount++;Consider whether the threshold of 5 is appropriate for your CI environment, or if it should be configurable.
ci/templates/self-hosted-cover-jobs.yml (1)
84-87: I'll verify the pool configuration by examining the actual YAML file and researching Azure Pipelines documentation on howvmImageandnameinteract.Based on the Azure Pipelines documentation, I can now verify the pool configuration concern:
vmImage is valid only in the Microsoft-hosted pool. When both
name(self-hosted) andvmImageare specified in a pool configuration, to use a Microsoft-hosted pool, omit the name and specify one of the available hosted images. This means thename: $(poolName)will take precedence and the pipeline will correctly execute on the self-hosted hetzner-docker pool, making thevmImage: $(imageName)redundant.The configuration is functional but non-standard. The pool will use the self-hosted "hetzner-docker" pool as intended, but setting
vmImageon a self-hosted pool configuration is unnecessary and could cause confusion.
Remove redundant vmImage from self-hosted pool configuration.
The pool specifies both
vmImage: $(imageName)andname: $(poolName). SincevmImageis only used for Microsoft-hosted pools, it should be removed when using a self-hosted pool (name). The pipeline will correctly run on hetzner-docker, but the configuration is cleaner without the unusedvmImageproperty.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
ci/templates/self-hosted-cover-jobs.yml(1 hunks)ci/templates/steps.yml(3 hunks)ci/test-pipeline.yml(2 hunks)compat/src/test/java/io/questdb/compat/InfluxDBClientFailureTest.java(1 hunks)core/pom.xml(2 hunks)core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java(3 hunks)core/src/main/java/io/questdb/cairo/wal/OperationExecutor.java(3 hunks)core/src/main/java/io/questdb/std/IOURingFacadeImpl.java(1 hunks)core/src/test/java/io/questdb/test/ServerMainQueryTimeoutTest.java(2 hunks)core/src/test/java/io/questdb/test/cutlass/http/MetricsIODispatcherTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/SqlCompilerImplTest.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (31)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (Check Changes Check changes)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (14)
core/src/test/java/io/questdb/test/cutlass/http/MetricsIODispatcherTest.java (1)
273-273: Verify timeout increase rationale and consistency.The HTTP response await timeout has been tripled to 15 seconds. While this may accommodate the performance characteristics of the new Hetzner infrastructure, consider:
- Is this increase based on observed timeout failures on Hetzner, or is it a preventive measure?
- The
recvtimeout on line 284 remains at 5 seconds. Should it also be increased for consistency, or is the longer initial response time specific to connection establishment?- Does this timeout increase mask a performance regression that should be investigated separately?
To verify whether the new infrastructure exhibits significantly different latency characteristics, you could add temporary logging or metrics around these HTTP operations to understand typical response times on Hetzner versus the previous Azure environment.
core/src/main/java/io/questdb/cairo/wal/OperationExecutor.java (3)
47-47: Good addition of configurable retry limit.The new field correctly captures the maximum recompilation attempts from the engine configuration, making the previously hardcoded limit (10) configurable.
68-68: Field initialization is correctly placed.The
maxRecompilationAttemptsfield is initialized after the engine assignment, ensuring the configuration is available.
102-102: Retry limit now uses configurable value.The hardcoded limit of 10 has been replaced with the configurable
maxRecompilationAttemptsfield, which is now set to 50 in the default configuration (see DefaultCairoConfiguration.java line 549).core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java (3)
548-550: Verify the 5x increase in retry limit is appropriate.The maximum SQL recompilation attempts increased from 10 to 50. While this provides more resilience for transient errors during table renames (as mentioned in the retry logic comment at OperationExecutor.java line 98-100), it could also delay error reporting by up to 5x in scenarios where the issue persists.
Confirm this increase aligns with expected retry behavior and doesn't introduce unacceptable delays in error scenarios.
678-681: New Parquet export configuration added.The new method
getParquetExportCopyReportFrequencyLines()returns 500,000, which appears to control the frequency of copy progress reporting for Parquet exports. The value seems reasonable for batch reporting.
694-701: Method reordering without functional changes.The
getParquetExportTableNamePrefix()andgetParquetExportVersion()methods have been reordered but their return values and signatures remain unchanged. This has no functional impact.core/pom.xml (2)
36-36: Add UseParallelGC to argLine for CI performance.Parallel garbage collection is a reasonable choice for CI/test environments to optimize throughput. However, pair this with documentation explaining the configuration's impact on build time and test execution on self-hosted runners.
87-88: Remove redundant flag-XX:TieredStopAtLevel=1and clarify the intent of tiered compilation disabling.The flags
-XX:-TieredCompilationand-XX:TieredStopAtLevel=1are redundant in current HotSpot—when tiered compilation is disabled,TieredStopAtLevelis ignored. The second flag has no effect and should be removed.This configuration lacks documentation explaining why tiered compilation is disabled. Before proceeding, verify:
- Is the intent to reduce compilation overhead, improve determinism, or maintain compatibility with older JVM versions?
- Has this been measured to impact build times on your CI environment?
If the goal is to constrain JIT compilation, use only
TieredStopAtLevel=1(with tiered compilation enabled) and remove the disable flag. If you must disable tiered compilation entirely, keep only-XX:-TieredCompilationand add an explanatory comment about the tradeoff.core/src/main/java/io/questdb/std/IOURingFacadeImpl.java (1)
107-110: Excellent documentation of the defensive probe rationale.The comment clearly explains why runtime verification is necessary beyond kernel version checks, especially relevant for containerized or restricted environments (like the Hetzner CI runners mentioned in the PR).
ci/templates/self-hosted-cover-jobs.yml (1)
3-4: Adding workspace clean for self-hosted runners.Workspace
clean: allis a good practice for self-hosted runners to free up disk space, though it may add slight overhead to job startup. This is acceptable for coverage jobs that run less frequently.ci/templates/steps.yml (2)
144-150: Whitespace alignment fix approved.Line 148 removes trailing space from
-Doutargument. This is good hygiene and doesn't affect functionality.
13-21: No issues found—the install step is correct and necessary for hetzner-docker.The step runs only on
poolName == 'hetzner-docker'because these self-hosted runners don't come with pre-installed Java/Maven. The path/usr/lib/jvm/java-17-openjdk-amd64is the standard Ubuntu location for theopenjdk-17-jdkpackage and will be created byapt-get install. The Azure Pipelines Maven@3 task resolves Java via thejdkvariable (set to1.17), and the step correctly sets bothJAVA_HOMEandJAVA_HOME_17_X64for downstream use. No conflicts or issues detected.Likely an incorrect or invalid review comment.
ci/test-pipeline.yml (1)
57-62: No code issues found—internal references to old stage name are absent.The search confirms the old stage name
HostedRunTestsCoverageBranchesis completely removed from the codebase. The rename is applied consistently withinci/test-pipeline.yml, with no broken references left behind.The review's concern about external systems (monitoring dashboards, notifications, automation scripts) referencing the old name is valid but falls outside code verification scope. These dependencies would need to be coordinated separately with the teams managing those systems.
[PR Coverage check]😍 pass : 12 / 13 (92.31%) file detail
|
It moves the coverage testing to Hetzner. This free-up Azure Pipelines Managed runners for MacOs and Windows builds and reduce queuing.