Skip to content

fix(ssl): verify-ca/verify-full chain completion + sslmode=prefer timeout fix#734

Merged
NikolayS merged 3 commits intomainfrom
fix/connect-timeout-and-verify-ca
Mar 25, 2026
Merged

fix(ssl): verify-ca/verify-full chain completion + sslmode=prefer timeout fix#734
NikolayS merged 3 commits intomainfrom
fix/connect-timeout-and-verify-ca

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Fixes

Closes #712 and #723.


#712 — verify-ca / verify-full fail with UnknownIssuer

Root cause

rustls's WebPkiServerVerifier requires the full certificate chain in the TLS handshake. PostgreSQL by default sends only the leaf cert (ssl_cert_file contains only the server cert). OpenSSL/psql can complete the chain from the local trust store; rustls cannot.

Fix

NoCnVerifier (verify-ca) and new FullVerifier (verify-full) now carry the certs parsed from sslrootcert as extra_intermediates. When initial verification fails with UnknownIssuer, a second attempt prepends those certs to the intermediates slice. This covers the common case — sslrootcert is a bundle with the CA root and any intermediates — without requiring Postgres server reconfiguration.

FullVerifier replaces the previous ClientConfig::with_root_certificates approach (which had no chain completion) with a custom ServerCertVerifier that does full chain + hostname validation via WebPkiServerVerifier.

Workaround that still works: Configure Postgres with ssl_cert_file = server-chain.crt (leaf + CA concatenated). Now the fix also works when you can't change the server.


#723 — sslmode=prefer timeout ~2× configured value

Root cause

connect_timeout is applied per attempt by tokio-postgres. For sslmode=prefer, rpg tries TLS (full timeout), then falls back to plain (full timeout again). Host unreachable = 2× elapsed.

Fix

In the Prefer match arm, detect timeout errors from the TLS probe (ConnectionFailed with "timeout" in the reason) and propagate immediately. Timeout = host unreachable = no point in falling back. Non-timeout TLS failures (server doesn't support SSL, handshake error) still trigger the plain fallback.


Changes

  • src/connection.rs:
    • load_certs_as_intermediates() — load raw DER certs from PEM file for use as intermediates
    • is_timeout_error() — classify timeout errors (testable helper)
    • NoCnVerifier — carries extra_intermediates, two-attempt chain verification
    • FullVerifier — new struct, same chain completion but WITH hostname check (verify-full)
    • make_tls_config_verify_full — uses FullVerifier instead of with_root_certificates
    • SslMode::Prefer match arm — propagates timeout immediately

Test note

CI's connection-tests job SKIPs D5/D6 (verify-ca/verify-full) pending #712. Once this PR merges, those SKIPs can be removed and the tests should pass with the self-signed cert bundle at /tmp/pg-tls-certs/srv-chain.crt.

@NikolayS
Copy link
Copy Markdown
Owner Author

REV Code Review — PR #734

Reviewing src/connection.rs diff (two fixes: chain-complete SSL for verify-ca/verify-full, and sslmode=prefer double-timeout).


PRAISE — Two-attempt chain completion algorithm

NoCnVerifier::verify_server_cert and FullVerifier::verify_server_cert both use the same clean two-attempt pattern: try what the server sent, then on UnknownIssuer retry with certs from sslrootcert prepended as intermediates. This matches what OpenSSL/psql do internally and is the correct algorithm. The guard !self.extra_intermediates.is_empty() means the fast path is unchanged when no bundle is configured.

Security analysis: augmenting the intermediate list does not lower the trust bar — the chain must still terminate at a trust anchor in roots. No security regression.


PRAISE — FullVerifier hostname check preserved

The new FullVerifier passes the real server_name (not dummy.invalid) to WebPkiServerVerifier::verify_server_cert, so hostname validation is fully intact for sslmode=verify-full. Clean separation from NoCnVerifier's dummy-name trick.


PRAISE — Timeout fix is correctly scoped

is_timeout_error is only checked on the sslmode=prefer TLS→plain fallback path. Other sslmodes are unaffected. The function correctly limits its match to ConnectionError::ConnectionFailed, so TLS errors (SslNotSupported, TlsError, etc.) are never misidentified as timeouts.


MINOR — is_timeout_error() relies on string matching (src/connection.rs, is_timeout_error)

fn is_timeout_error(e: &ConnectionError) -> bool {
    match e {
        ConnectionError::ConnectionFailed { reason, .. } => {
            let r = reason.to_lowercase();
            r.contains("timeout") || r.contains("timed out")
        }
        _ => false,
    }
}

This works for the current tokio-postgres 0.7.16, which explicitly constructs:

io::Error::new(io::ErrorKind::TimedOut, "connection timed out")

(confirmed in connect_socket.rs). It also catches OS-level ETIMEDOUT ("Connection timed out").

The problem: the match depends on a human-readable string from an external crate. A future tokio-postgres patch that changes the message (e.g., to "connect timed out" or "deadline elapsed") would silently break the fix — the fallback would proceed and double-timeout again, with no error at the call site.

Suggestion: Add a comment citing the exact tokio-postgres source location and version:

// Matches the literal string tokio-postgres constructs for connect_timeout
// expiry (connect_socket.rs:62, v0.7.x: "connection timed out") and
// OS-level ETIMEDOUT ("Connection timed out").  If tokio-postgres changes
// this string the fix degrades gracefully (fallback proceeds, slower) but
// silently — revisit if #723 reappears after a tokio-postgres upgrade.

Longer term, thread io::ErrorKind through ConnectionError::ConnectionFailed so the check can be kind == TimedOut instead of a string scan.


MINOR — PGCONNECT_TIMEOUT=0s when no timeout is configured (src/connection.rs:2426)

reason: format!("connection timed out (PGCONNECT_TIMEOUT={timeout}s)",
    timeout = params.connect_timeout.unwrap_or(0)),

When connect_timeout is None (no application-level timeout configured), an OS-level TCP timeout (ETIMEDOUT) can still fire. is_timeout_error would catch it (the OS message "Connection timed out" matches), and the error displayed to the user would be:

connection timed out (PGCONNECT_TIMEOUT=0s)

0s implies the user configured a zero-second timeout, which is incorrect and confusing. When connect_timeout is None, the message should not mention PGCONNECT_TIMEOUT at all, or should omit the value.

Suggestion:

reason: match params.connect_timeout {
    Some(t) => format!("connection timed out (PGCONNECT_TIMEOUT={t}s)"),
    None    => "connection timed out".to_string(),
},

MINOR — WebPkiServerVerifier rebuilt on every call in FullVerifier::verify_chain (src/connection.rs, FullVerifier)

fn verify_chain(&self, ...) -> Result<ServerCertVerified, RustlsError> {
    let verifier = rustls::client::WebPkiServerVerifier::builder_with_provider(
        Arc::new(self.roots.clone()),   // clones entire RootCertStore each call
        Arc::clone(&self.provider),
    )
    .build()
    .map_err(|e| RustlsError::General(...))?;

    verifier.verify_server_cert(...)
}

build() validates all trust anchors; self.roots.clone() copies potentially hundreds of roots (e.g. the webpki_roots bundle). This runs once per TLS handshake, and twice per handshake on the chain-completion retry path.

NoCnVerifier::verify_chain has the same pattern (pre-existing, not new in this PR), so fixing only FullVerifier would be inconsistent, but FullVerifier is entirely new code.

Suggestion: Build the inner verifier once in FullVerifier::new() and store Arc<WebPkiServerVerifier>:

struct FullVerifier {
    inner: Arc<rustls::client::WebPkiServerVerifier>,
    extra_intermediates: Vec<CertificateDer<'static>>,
}

If build() can fail, propagate the error through a Result-returning new() (callers already handle Result<ClientConfig, ConnectionError>).


NIT — Silent failure in load_certs_as_intermediates (src/connection.rs:1644–1651)

let Ok(pem) = std::fs::read(p) else { return vec![] };
rustls_pemfile::certs(&mut pem.as_slice())
    .filter_map(Result::ok)   // silently drops parse errors
    .collect()

In practice the file was already read successfully by load_root_cert_store moments earlier, so both silent branches are unlikely. The filter_map(Result::ok) is the more concerning one: a partially-malformed bundle (e.g. mixed PEM/DER) would yield fewer intermediates than expected without any diagnostic.

This doesn't affect chain verification correctness (the first attempt still runs; only the retry path is impacted), but it could make debugging issue recurrences harder.

Suggestion: Not worth restructuring for this PR, but log a debug/trace message on parse error, or at minimum document the silent-failure intent in a comment.


NIT — ring provider hardcoded in both verifiers

Both NoCnVerifier::new and FullVerifier::new do:

provider: Arc::new(rustls::crypto::ring::default_provider()),

If the crate later adds an aws-lc-rs feature or makes the provider configurable at the ClientConfig level, both verifiers will need updating. This matches the pre-existing NoCnVerifier pattern so no regression here, but worth a // TODO if a provider-config story is planned.


Overall verdict: APPROVE (with MINOR issues noted)

Both fixes are logically correct. The chain-completion two-attempt approach matches psql/OpenSSL semantics and introduces no security regression. The timeout propagation is correctly scoped. The two MINOR issues (PGCONNECT_TIMEOUT=0s display and WebPkiServerVerifier rebuild cost) are real but non-blocking; the string-matching fragility is low risk given the explicit tokio-postgres source pin. All findings above are NON-BLOCKING.

Copyright 2026.

@NikolayS
Copy link
Copy Markdown
Owner Author

REV Code Review — PR #734

Reviewing src/connection.rs diff (two fixes: chain-complete SSL for verify-ca/verify-full, and sslmode=prefer double-timeout).


PRAISE — Two-attempt chain completion algorithm

NoCnVerifier::verify_server_cert and FullVerifier::verify_server_cert both use the same clean two-attempt pattern: try what the server sent, then on UnknownIssuer retry with certs from sslrootcert prepended as intermediates. This matches what OpenSSL/psql do internally and is the correct algorithm. The guard !self.extra_intermediates.is_empty() means the fast path is unchanged when no bundle is configured.

Security analysis: augmenting the intermediate list does not lower the trust bar — the chain must still terminate at a trust anchor in roots. No security regression.


PRAISE — FullVerifier hostname check preserved

The new FullVerifier passes the real server_name (not dummy.invalid) to WebPkiServerVerifier::verify_server_cert, so hostname validation is fully intact for sslmode=verify-full. Clean separation from NoCnVerifier's dummy-name trick.


PRAISE — Timeout fix is correctly scoped

is_timeout_error is only checked on the sslmode=prefer TLS→plain fallback path. Other sslmodes are unaffected. The function correctly limits its match to ConnectionError::ConnectionFailed, so TLS errors (SslNotSupported, TlsError, etc.) are never misidentified as timeouts.


MINOR — is_timeout_error() relies on string matching (src/connection.rs, is_timeout_error)

fn is_timeout_error(e: &ConnectionError) -> bool {
    match e {
        ConnectionError::ConnectionFailed { reason, .. } => {
            let r = reason.to_lowercase();
            r.contains("timeout") || r.contains("timed out")
        }
        _ => false,
    }
}

This works for the current tokio-postgres 0.7.16, which explicitly constructs:

io::Error::new(io::ErrorKind::TimedOut, "connection timed out")

(confirmed in connect_socket.rs). It also catches OS-level ETIMEDOUT ("Connection timed out").

The problem: the match depends on a human-readable string from an external crate. A future tokio-postgres patch that changes the message (e.g., to "connect timed out" or "deadline elapsed") would silently break the fix — the fallback would proceed and double-timeout again, with no error at the call site.

Suggestion: Add a comment citing the exact tokio-postgres source location and version:

// Matches the literal string tokio-postgres constructs for connect_timeout
// expiry (connect_socket.rs:62, v0.7.x: "connection timed out") and
// OS-level ETIMEDOUT ("Connection timed out").  If tokio-postgres changes
// this string the fix degrades gracefully (fallback proceeds, slower) but
// silently — revisit if #723 reappears after a tokio-postgres upgrade.

Longer term, thread io::ErrorKind through ConnectionError::ConnectionFailed so the check can be kind == TimedOut instead of a string scan.


MINOR — PGCONNECT_TIMEOUT=0s when no timeout is configured (src/connection.rs:2426)

reason: format!("connection timed out (PGCONNECT_TIMEOUT={timeout}s)",
    timeout = params.connect_timeout.unwrap_or(0)),

When connect_timeout is None (no application-level timeout configured), an OS-level TCP timeout (ETIMEDOUT) can still fire. is_timeout_error would catch it (the OS message "Connection timed out" matches), and the error displayed to the user would be:

connection timed out (PGCONNECT_TIMEOUT=0s)

0s implies the user configured a zero-second timeout, which is incorrect and confusing. When connect_timeout is None, the message should not mention PGCONNECT_TIMEOUT at all, or should omit the value.

Suggestion:

reason: match params.connect_timeout {
    Some(t) => format!("connection timed out (PGCONNECT_TIMEOUT={t}s)"),
    None    => "connection timed out".to_string(),
},

MINOR — WebPkiServerVerifier rebuilt on every call in FullVerifier::verify_chain (src/connection.rs, FullVerifier)

fn verify_chain(&self, ...) -> Result<ServerCertVerified, RustlsError> {
    let verifier = rustls::client::WebPkiServerVerifier::builder_with_provider(
        Arc::new(self.roots.clone()),   // clones entire RootCertStore each call
        Arc::clone(&self.provider),
    )
    .build()
    .map_err(|e| RustlsError::General(...))?;

    verifier.verify_server_cert(...)
}

build() validates all trust anchors; self.roots.clone() copies potentially hundreds of roots (e.g. the webpki_roots bundle). This runs once per TLS handshake, and twice per handshake on the chain-completion retry path.

NoCnVerifier::verify_chain has the same pattern (pre-existing, not new in this PR), so fixing only FullVerifier would be inconsistent, but FullVerifier is entirely new code.

Suggestion: Build the inner verifier once in FullVerifier::new() and store Arc<WebPkiServerVerifier>:

struct FullVerifier {
    inner: Arc<rustls::client::WebPkiServerVerifier>,
    extra_intermediates: Vec<CertificateDer<'static>>,
}

If build() can fail, propagate the error through a Result-returning new() (callers already handle Result<ClientConfig, ConnectionError>).


NIT — Silent failure in load_certs_as_intermediates (src/connection.rs:1644–1651)

let Ok(pem) = std::fs::read(p) else { return vec![] };
rustls_pemfile::certs(&mut pem.as_slice())
    .filter_map(Result::ok)   // silently drops parse errors
    .collect()

In practice the file was already read successfully by load_root_cert_store moments earlier, so both silent branches are unlikely. The filter_map(Result::ok) is the more concerning one: a partially-malformed bundle (e.g. mixed PEM/DER) would yield fewer intermediates than expected without any diagnostic.

This doesn't affect chain verification correctness (the first attempt still runs; only the retry path is impacted), but it could make debugging issue recurrences harder.

Suggestion: Not worth restructuring for this PR, but log a debug/trace message on parse error, or at minimum document the silent-failure intent in a comment.


NIT — ring provider hardcoded in both verifiers

Both NoCnVerifier::new and FullVerifier::new do:

provider: Arc::new(rustls::crypto::ring::default_provider()),

If the crate later adds an aws-lc-rs feature or makes the provider configurable at the ClientConfig level, both verifiers will need updating. This matches the pre-existing NoCnVerifier pattern so no regression here, but worth a // TODO if a provider-config story is planned.


Overall verdict: APPROVE (with MINOR issues noted)

Both fixes are logically correct. The chain-completion two-attempt approach matches psql/OpenSSL semantics and introduces no security regression. The timeout propagation is correctly scoped. The two MINOR issues (PGCONNECT_TIMEOUT=0s display and WebPkiServerVerifier rebuild cost) are real but non-blocking; the string-matching fragility is low risk given the explicit tokio-postgres source pin. All findings above are NON-BLOCKING.

Copyright 2026.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 25, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 5.07246% with 131 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.38%. Comparing base (fd54b19) to head (b0d655c).

Files with missing lines Patch % Lines
src/connection.rs 5.07% 131 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #734      +/-   ##
==========================================
- Coverage   68.65%   68.38%   -0.27%     
==========================================
  Files          46       46              
  Lines       31132    31265     +133     
==========================================
+ Hits        21372    21379       +7     
- Misses       9760     9886     +126     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

#712)

rustls requires the full certificate chain in the TLS handshake.
PostgreSQL by default sends only the leaf certificate; OpenSSL (psql)
can complete the chain from the local trust store but rustls cannot,
causing UnknownIssuer errors even when the correct CA cert is supplied
via sslrootcert.

Fix: NoCnVerifier (verify-ca) and new FullVerifier (verify-full) now
carry the certs parsed from sslrootcert as extra_intermediates.  When
the initial verification fails with UnknownIssuer, a second attempt is
made with those certs prepended to the intermediates slice.  This covers
the common case where sslrootcert is a bundle containing the CA and any
intermediate certs, without requiring changes to the Postgres server
configuration.

FullVerifier replaces the previous make_tls_config_verify_full approach
(which used ClientConfig::with_root_certificates and had no chain
completion) with a custom ServerCertVerifier that performs full chain +
hostname validation using WebPkiServerVerifier.

fix(connection): sslmode=prefer timeout 2x configured value (#723)

When connect_timeout is set and sslmode=prefer, tokio-postgres applies
the timeout per attempt.  The TLS probe and the plaintext fallback each
consumed the full budget, making the total elapsed time ~2x the
configured value.

Fix: in the Prefer match arm, detect timeout errors from the TLS probe
(ConnectionFailed with 'timeout' in the reason string) and propagate
immediately without falling back.  A timeout means the host is
unreachable; falling back to plain would only time out again.
Non-timeout TLS failures (server does not support SSL, handshake
refused) still trigger the plain fallback as before.

Adds is_timeout_error() helper for testability.
Adds load_certs_as_intermediates() helper to load raw DER certs from a
PEM file for use as intermediates.
@NikolayS NikolayS force-pushed the fix/connect-timeout-and-verify-ca branch from 0da6f6c to b0d655c Compare March 25, 2026 02:09
@NikolayS NikolayS merged commit 406616f into main Mar 25, 2026
16 checks passed
@NikolayS NikolayS deleted the fix/connect-timeout-and-verify-ca branch March 25, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(ssl): sslmode=verify-ca and verify-full fail with UnknownIssuer when server does not send full chain

2 participants