Skip to content

fix(docker): suppress Go HTTP/2 auto-config breaking plain-tcp:// hijack (#668)#681

Merged
CybotTM merged 2 commits into
mainfrom
fix/668-hijack-tls-auto-mutation
May 15, 2026
Merged

fix(docker): suppress Go HTTP/2 auto-config breaking plain-tcp:// hijack (#668)#681
CybotTM merged 2 commits into
mainfrom
fix/668-hijack-tls-auto-mutation

Conversation

@CybotTM

@CybotTM CybotTM commented May 15, 2026

Copy link
Copy Markdown
Member

Summary

Closes #668.

With DOCKER_HOST=tcp://socket-proxy:2375 (plain HTTP, no TLS) the regular HTTP path works but ContainerExecAttach (the hijack path used by run_exec, ContainerAttach, ContainerLogs --follow) fails with:

tls: first record does not look like a TLS handshake

Root cause

Go's net/http lazily auto-configures HTTP/2 on the first request and allocates Transport.TLSClientConfig in place (to seed NextProtos=[h2 http/1.1] for ALPN). The Docker SDK's hijack dialer reads baseTransport.TLSClientConfig as its "TLS is required" signal. After the first plain-HTTP request lands, the SDK misreads the auto-installed config as operator-configured TLS and calls tls.Dial against the plain Docker daemon — handshake fails because the daemon answers with raw HTTP.

Stage DialContext TLSClientConfig
After NewClientWithOpts nil nil
After first _ping nil non-nil (NextProtos=[h2 http/1.1]) ❌

The SDK author already flagged the hijack dialer in moby/moby client.go#L439-L443 as suspect: "I honestly don't know why it doesn't do that."

Fix

One knob: set transport.TLSNextProto to a non-nil empty map on the non-TLS apply paths (applyUnixTransport, applyTCPTransport, applyPlainTransport) plus the createHTTPClient unknown-scheme fallback. That is the documented Go stdlib contract for "do not auto-enable HTTP/2", which prevents the in-place TLSClientConfig mutation. TLS Docker hosts (https://, tcp+tls://) keep TLSNextProto nil so ALPN h2 negotiation still works there.

Extracted as disableHTTP2AutoConfig with a // Why: comment so the subtle stdlib contract isn't deleted as dead code by a future reader.

Tests

New client_tcp_hijack_test.go (verified bidirectionally — fails with the exact user error pre-fix, passes post-fix):

  • TestPlainTCPHijack_ContainerExecAttachWorks — drives the full ContainerExecAttach hijack path against a plain-HTTP TCP listener with 101-Switching-Protocols semantics. The exact-shape regression test the existing suite lacked.
  • TestDisableHTTP2AutoConfig_KeepsTLSClientConfigNil — unit test pinning the stdlib invariant directly so a future net/http refactor cannot silently re-introduce the bug while the end-to-end SDK test still passes by chance.
  • TestApplyTLSTransport_DoesNotDisableHTTP2AutoConfig — negative control: TLS apply paths must keep TLSNextProto nil so ALPN h2 negotiation remains intact. Guards against accidentally extending the suppression to the TLS path.
  • TestApplyTransport_NonTLSDisableHTTP2AutoConfig — table test pinning that every non-TLS apply function calls the helper. Catches the silent-omit footgun if new schemes are added.
  • TestPlainHTTPHijack_ContainerExecAttach_FollowUpt.Skip placeholder so the gap is visible in go test -v and grep-able to fix(docker): http:// hijack falls through to invalid net.Dial("http", addr) #682.

Race-clean over 10 iterations under go test -race.

Docs

Out of scope

http:// hijack has a separate, pre-existing failure mode: SDK falls to net.Dial("http", addr), which Go rejects with unknown network http. Fixing it needs a TCP DialContext on the http transport. Filed as #682; called out in test comments and t.Skip-tracked above.

Test plan

  • New regression test passes with fix, fails without it
  • go test ./core/... clean
  • go test -race -count=10 clean (no flakes, no goroutine race)
  • golangci-lint run ./core/adapters/docker/... clean
  • CI green

…hijack

Closes #668.

With DOCKER_HOST=tcp://socket-proxy:2375 (plain HTTP, no TLS) the regular
HTTP path works but ContainerExecAttach (the hijack path used by run_exec,
ContainerAttach, ContainerLogs --follow) fails with

  tls: first record does not look like a TLS handshake

Root cause: Go's net/http lazily auto-configures HTTP/2 on the first
request and ALLOCATES Transport.TLSClientConfig in place (to seed
NextProtos=[h2 http/1.1] for ALPN). The Docker SDK's hijack dialer reads
baseTransport.TLSClientConfig as its "TLS is required" signal. After the
first plain-HTTP request lands the SDK misreads the auto-installed config
as operator-configured TLS and calls tls.Dial against the plain daemon.

Reproducer (drives ContainerExecAttach against a plain-HTTP TCP listener
that issues 101 Switching Protocols) fails with the exact user error pre-
fix and passes post-fix. The existing client_tcp_upgrade suite only
asserts DaemonHost() strings — it never exercises the hijack dialer.

Fix: set transport.TLSNextProto to a non-nil empty map on the non-TLS
apply paths (applyUnixTransport, applyTCPTransport, applyPlainTransport).
That is the documented stdlib contract for "do not auto-enable HTTP/2",
which prevents the in-place TLSClientConfig mutation. TLS Docker hosts
(https://, tcp+tls://) keep TLSNextProto nil so ALPN h2 still works.

Note: http:// hijack has a separate, pre-existing failure mode (SDK falls
to net.Dial("http", addr) which Go rejects with "unknown network http").
That requires a TCP DialContext on the http:// transport and is filed as
a follow-up enhancement.

Signed-off-by: Sebastian Mendel <[email protected]>
Copilot AI review requested due to automatic review settings May 15, 2026 21:28
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the tests label May 15, 2026
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

github-actions[bot]
github-actions Bot previously approved these changes May 15, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a TLS handshake error on the hijack path (ContainerExecAttach, used by run_exec) when DOCKER_HOST is a plain tcp:// URL. Root cause: Go's net/http lazily auto-configures HTTP/2 on first request, allocating Transport.TLSClientConfig in place; the Docker SDK's hijack dialer then misreads that as "TLS required" and attempts tls.Dial against a plain daemon.

Changes:

  • Adds disableHTTP2AutoConfig helper that sets transport.TLSNextProto to a non-nil empty map (the documented stdlib opt-out for HTTP/2 auto-config).
  • Wires the helper into the three non-TLS apply paths: applyUnixTransport, applyTCPTransport, applyPlainTransport (TLS paths keep TLSNextProto nil to allow ALPN h2).
  • Adds a regression test that primes the transport via Ping and then exercises ContainerExecAttach against a fake plain-HTTP Docker proxy with a 101 Switching Protocols hijack.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
core/adapters/docker/client.go Introduces disableHTTP2AutoConfig and applies it to non-TLS transports (unix/tcp/plain).
core/adapters/docker/client_tcp_hijack_test.go New regression test driving the hijack path against a fake plain-HTTP TCP Docker proxy.

@github-actions

Copy link
Copy Markdown

⚠️ Mutation Testing Results

Mutation Score: 0.00% (threshold: 60%)

⚠️ Score is below threshold. Consider improving test coverage or test quality.

What is mutation testing?

Mutation testing measures test quality by introducing small changes (mutations) to the code and checking if tests detect them. A higher score means better test effectiveness.

  • Killed mutants: Tests caught the mutation (good!)
  • Survived mutants: Tests missed the mutation (needs improvement)

@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.02%. Comparing base (566583b) to head (3f5c8bc).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #681   +/-   ##
=======================================
  Coverage   88.01%   88.02%           
=======================================
  Files          89       89           
  Lines       11280    11286    +6     
=======================================
+ Hits         9928     9934    +6     
  Misses       1103     1103           
  Partials      249      249           
Flag Coverage Δ
integration 88.00% <100.00%> (-0.02%) ⬇️
unittests 85.17% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Tightens PR #681 based on the multi-axis review fan-out:

- Add unit test on disableHTTP2AutoConfig itself (transport-level invariant)
  so a future stdlib refactor cannot silently re-introduce the bug while
  the end-to-end SDK test still passes by chance.
- Add TLS-path negative control (TestApplyTLSTransport_DoesNotDisableHTTP2AutoConfig)
  that guards against accidentally extending the suppression to the TLS
  apply path, which would silently disable ALPN h2 negotiation for https://
  and tcp+tls:// hosts.
- Add table test pinning that every non-TLS apply function calls
  disableHTTP2AutoConfig — catches the silent-omit footgun when new
  schemes are added in future.
- Add t.Skip placeholder for http:// hijack so the gap is visible in
  `go test -v` output and reachable via grep from the test directory.
- Apply disableHTTP2AutoConfig to the createHTTPClient unknown-scheme
  fallback (defensive — same hijack risk applies to direct test callers
  with exotic schemes).
- Fix t.Logf race in the fake-proxy serve goroutine by routing through
  http.Server.Shutdown and a done channel so the goroutine exits before
  test cleanup completes.
- Trim disableHTTP2AutoConfig doc comment (drop cost-justification
  paragraph and the upstream-quote nicety; keep the load-bearing WHY
  and the upstream-code reference).
- Add CHANGELOG.md Unreleased > Fixed entry (matches the docker-adapter
  release-notes convention from prior PRs #634 / #627 / #653 / #613).
- Add docs/TROUBLESHOOTING.md entry for v0.25.0 plain-tcp:// run_exec
  failure mode (slots next to the related #605 socket-proxy section);
  cross-reference from the older v0.11.0 HTTP/2 section so operators
  scanning by topic land on the right bug.

Deferred to follow-up issues (out of scope for this PR):
- Refactor to schemeHandler.tlsCapable flag to kill the silent-omit
  footgun structurally.
- Dead ClientConfig.HTTPClient field (pre-existing, unrelated cleanup).

Signed-off-by: Sebastian Mendel <[email protected]>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 15, 2026
@sonarqubecloud

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated approval for maintainer PR

All automated quality gates passed. See SECURITY_CONTROLS.md for compensating controls.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@CybotTM CybotTM added this pull request to the merge queue May 15, 2026
Merged via the queue into main with commit ae02413 May 15, 2026
31 checks passed
@CybotTM CybotTM deleted the fix/668-hijack-tls-auto-mutation branch May 15, 2026 22:10
CybotTM added a commit that referenced this pull request May 16, 2026
Address PR review: the new json-post bundled preset (#676) is an Added
feature and per semver belongs in a minor release (0.26.0), not a
patch (0.25.1). Move it back under [Unreleased] above the new
[0.25.1] heading. The four bugfixes (#681, #685, #686, #670) remain
in 0.25.1.

Signed-off-by: Sebastian Mendel <[email protected]>
CybotTM added a commit that referenced this pull request May 16, 2026
…Capable

Promote "non-TLS schemes must suppress HTTP/2 auto-config" from a
manually-duplicated call inside four apply* functions to a data-driven
property on the schemeHandler struct. createHTTPClient now invokes
disableHTTP2AutoConfig when !handler.tlsCapable, so:

- applyTLSTransport (https://, tcp+tls://): tlsCapable=true, no
  suppression — ALPN h2 negotiation still works.
- All other apply functions: tlsCapable=false, suppression handled
  centrally — a future scheme added to schemeHandlers either declares
  tlsCapable=true (and accepts the implications) or gets the #668 fix
  for free.

Closes the "silent regression risk" gap from #683: the per-apply
disableHTTP2AutoConfig call could be forgotten when adding a new
scheme. The structural invariant is now compile-time enforced via the
flag on the dispatch table.

No behavior change for existing schemes. Two test reshapes:

- TestApplyTransport_NonTLSDisableHTTP2AutoConfig renamed to
  TestSchemeHandler_NonTLSCapableSuppressesHTTP2 and rewritten to
  iterate schemeHandlers and assert via createHTTPClient instead of
  calling each apply function directly. The new shape automatically
  covers any future tlsCapable=false scheme.
- New TestSchemeHandler_TLSCapableMeansHTTP2Allowed locks the inverse
  half of the invariant per the #683 issue spec.

Refs #683, #681, #668.

Signed-off-by: Sebastian Mendel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

run_exec failing when setting DOCKER_HOST

2 participants