Fix Hello Retry Request failure in QUIC by jogme · Pull Request #28115 · openssl/openssl

jogme · 2025-07-29T13:18:08Z

When a HRR happens, the clients new hello message contains transport parameters extension which we currently handle with a fail. In this fix I went with the option to drop the first params and reread the new ones. I am not 100% sure if that's the correct way - and not keeping the params from the first client hello - so please correct me if that's wrong.

Unfortunately this does not solve the quic interop CI issue with ngtcp2.

Fixes: openssl/project#1296

Checklist

documentation is added or updated
tests are added or updated

t8m · 2025-07-29T14:38:02Z

ssl/quic/quic_channel.c

+    /*
+     * When HRR happens the client sends the transport params in the new client
+     * hello again. Reset the transport params here and load them again.
+     */
+    if (sc->hello_retry_request != SSL_HRR_NONE && ch->got_remote_transport_params) {
+        ch->max_local_streams_bidi = 0;
+        ch->max_local_streams_uni = 0;
+        ch->got_local_transport_params = 0;
+        OPENSSL_free(ch->local_transport_params);
+        ch->local_transport_params = NULL;


Should we in this case actually check that the transport params as sent in the new client hello are the same as the original ones? Hmm... I did not find anything in regards to that in RFCs so perhaps this is OK as is.

I'm not sure, as the RFC does not mentions this case (at least I did not found it)

I think, generally what you are doing for this specific case of HRR is ok, as like @t8m theres nothing concrete in the RFC about this. Generally, you're not allowed to reset your transport params during a connection, but because the HRR is sent over an initial frame and triggers a new initial frame from the client, we're in this state where the connection is created from a QUIC standpoint, but expects to go through the channel establishment code again, as though the second initial packet from the client is acting (sort of) like the first. Its ugly from an RFC standpoint, but at least for now, ok.

That said, I think you may need to do a bit more here than just cleanse the transport parameters. Getting transport parameters implies the setting of several other fields in the channel structure (init_dcid, init_scid, etc), that may be different the second time around (i.e. the DCID in the second client frame is adopted from the SCID provided in the HRR), which may be causing some of your CI failures.

The CI is failing on the "test setup" when it sets the group list. Which I don't understand why at the moment..

When I try to just return 1 and don't reset the option, the failure remains the same tho

That said, I think you may need to do a bit more here than just cleanse the transport parameters. Getting transport parameters implies the setting of several other fields in the channel structure (init_dcid, init_scid, etc), that may be different the second time around (i.e. the DCID in the second client frame is adopted from the SCID provided in the HRR), which may be causing some of your CI failures.

The CI failures where FIPS related. Do you still think it would be better to check other fields also? My interpretation is, that it should not change anyway; we could also ignore the newly sent parameters and use the initial ones. Maybe that would be more straightforward. Can client hello retry happen because of change one of the transport parameters (or other parameters we are talking about here)?

Unsure. I think you'll get it to work with this approach (at least it works for me in a local test). Its probably worth doing this change as you have it and addressing subsequent problems as they occur.

t8m · 2025-07-29T14:39:28Z

CI is failing

jogme · 2025-07-29T14:43:38Z

I probably use algorithms in the test which are not available at some runners. Will take a look

Signed-off-by: Norbert Pocs <[email protected]>

When HRR happens a second client hello is sent and it consist of a transport params extension. This must be processed and not cause failure. Fixes: openssl/project#1296 Signed-off-by: Norbert Pocs <[email protected]>

Recently, our overnight QUIC interop runs began failing in CI when an openssl server was tested against an ngtcp2 client: https://github.com/openssl/openssl/actions/runs/16739736813 The underlying cause bears some explination for historical purposes The problem began happening with a recent update to ngtcp2 in which ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused ngtcp to emit a client hello message that offered several groups (including X25519MLKEM768) but only provided a keyshare for x25519. This in turn triggered the openssl server to respond with a hello retry request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2 obliged. However all subsequent frames from the client were discarded by the server, due to failing packet body decryption. The problem was tracked down to a mismatch in the initial vectors used by the client and server, leading to an AEAD tag mismatch. Packet protection keys generate their IV's in QUIC by xoring the packet number of the received frame with the base IV as derived via HKDF in the tls layer. The underlying problem was that openssl hit a very odd corner case with how we compute the packet number of the received frame. To save space, QUIC encodes packet numbers using a variable length integer, and only sends the changed bits in the packet number. This requires that the receiver (openssl) store the largest received pn of the connection, which we nominally do. However, in default_port_packet_handler (where QUIC frames are processed prior to having an established channel allocated) we use a temporary qrx to validate the packet protection of those frames. This temporary qrx may be incorporated into the channel in some cases, but is not in the case of a valid frame that generates an HRR at the TLS layer. In this case, the channel allocates its own qrx independently. When this occurs, the largest_pn value of the temporary qrx is lost, and subsequent frames are unable to be received, as the newly allocated qrx belives that the larges_pn for a given pn_space is 0, rather than the value received in the initial frame (which was a complete 32 bit value, rather than just the changed lower 8 bits). As a result the IV construction produced the wrong value, and the decrypt failed on those subsequent frames. Up to this point, that wasn't even a problem, as most quic implementations start their packet numbering at 0, so the next packet could still have its packet number computed properly. The combination of ngtcp using large random values for initial packet numbers, along with the HRR triggering a separate qrx creation on a channel led to the discovery of this discrepancy. The fix seems pretty straightforward. When we detect in port_default_packet_handler, that we have a separate qrx in the new channel, we migrate processed packets from the temporary qrx to the canonical channel qrx. In addition to doing that, we also need to migrate the largest_pn array from the temporary qrx to the channel_qrx so that subsequent frame reception is guaranteed to compute the received frame packet number properly, and as such, compute the proper IV for packet protection decryption. Fixes openssl/project#1296

nhorman

I imagine that we will need to open separate PR's to adjust this patchset for the backports, but it looks good to me. Nicely done.

mattcaswell · 2025-08-06T13:21:37Z

ssl/quic/quic_channel.c

+        && ch->got_remote_transport_params) {
+        ch->max_local_streams_bidi = 0;
+        ch->max_local_streams_uni = 0;
+        ch->got_local_transport_params = 0;


This seems to be a subset of the transport parameters. Why do we not reset all of them?

The subset was chosen based on the checks below in that function (if set the function fails). The other parameters are just rewritten without a check.

ssl/quic/quic_port.c

nhorman · 2025-08-06T14:02:50Z

marking as, and ok with urgent, as this addresses a standing CI failure

Sashan

looks good to me.

Sashan · 2025-08-06T16:06:50Z

the changes related to port_default_packet_handler() are relevant for for 3.5 and later. The earlier versions of OpenSSL don't create qrx to let port to validate packet header before channel is created.

In case those changes will be back ported to 3.4 and earlier versions, then only change to ssl/quic/quic_channel.c is relevant.

Sashan · 2025-08-07T09:26:56Z

the changes related to port_default_packet_handler() are relevant for for 3.5 and later. The earlier versions of OpenSSL don't create qrx to let port to validate packet header before channel is created.

In case those changes will be back ported to 3.4 and earlier versions, then only change to ssl/quic/quic_channel.c is relevant.

My yesterday assessment was wrong. the change is relevant to 3.5 only where QUIC server API appeared. The fix here does not matter for 3.4 and earlier because those version provide only client side API for QUIC.

It's OpenSSL 3.5 where OSSL_QUIC_server_method(3) appears.

jogme · 2025-08-07T09:29:40Z

because of the MR conflict on 3.5 - creating a separate PR for it: #28189

mattcaswell · 2025-08-07T13:07:38Z

We should have a separate test for that, but injecting specific intial packet numbers is not something our quic test suite does easily

Can we not use the fault injector for this purpose?

mattcaswell · 2025-08-07T13:08:26Z

Although it looks like the fault injector might need to be updated a bit. It currently uses tserver.

nhorman · 2025-08-07T13:14:12Z

We could maybe use the fault injector, but I think it would be better to use the channel mutator, as we're not trying to inject a packet into the stream, we need to modify an existing initial packet to have a different (large) packet number encoded into it. Just need to do some looking to ensure that the qtx doesn't track packet numbers independently of the wire encoded frame first.

mattcaswell · 2025-08-07T13:19:48Z

We could maybe use the fault injector, but I think it would be better to use the channel mutator, as we're not trying to inject a packet into the stream, we need to modify an existing initial packet to have a different (large) packet number encoded into it. Just need to do some looking to ensure that the qtx doesn't track packet numbers independently of the wire encoded frame first.

We are talking about the same thing I think. The channel mutator (ossl_quic_channel_set_mutator) exists in order to support the fault injector. It is exposed via ossl_quic_tserver_set_plain_packet_mutator which is there because of the fault injector's packet_plain_mutate callback.

mattcaswell · 2025-08-07T13:28:10Z

The idea here is that you set up the objects for a connection via qtest_create_quic_objects. This gives you a handle on the fault injector which enables you to set a callback when packets arrive via qtest_fault_set_packet_plain_listener. Then you complete the connection via qtest_create_quic_connection. Your callback has the opportunity to mutate the packets in whichever way you see fit.

See for example test_ncid_frame in quic_newcid_test.c. That uses the callback to inject a new frame - but you can equally mutate the data in the packet.

But it does need updating to work with real server objects - not tserver.

openssl-machine · 2025-08-07T17:00:33Z

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

nhorman · 2025-08-07T17:26:48Z

merged to master, thank you!

Signed-off-by: Norbert Pocs <[email protected]> Reviewed-by: Saša Nedvědický <[email protected]> Reviewed-by: Neil Horman <[email protected]> Reviewed-by: Matt Caswell <[email protected]> (Merged from #28115)

When HRR happens a second client hello is sent and it consist of a transport params extension. This must be processed and not cause failure. Fixes: openssl/project#1296 Signed-off-by: Norbert Pocs <[email protected]> Reviewed-by: Saša Nedvědický <[email protected]> Reviewed-by: Neil Horman <[email protected]> Reviewed-by: Matt Caswell <[email protected]> (Merged from #28115)

Recently, our overnight QUIC interop runs began failing in CI when an openssl server was tested against an ngtcp2 client: https://github.com/openssl/openssl/actions/runs/16739736813 The underlying cause bears some explination for historical purposes The problem began happening with a recent update to ngtcp2 in which ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused ngtcp to emit a client hello message that offered several groups (including X25519MLKEM768) but only provided a keyshare for x25519. This in turn triggered the openssl server to respond with a hello retry request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2 obliged. However all subsequent frames from the client were discarded by the server, due to failing packet body decryption. The problem was tracked down to a mismatch in the initial vectors used by the client and server, leading to an AEAD tag mismatch. Packet protection keys generate their IV's in QUIC by xoring the packet number of the received frame with the base IV as derived via HKDF in the tls layer. The underlying problem was that openssl hit a very odd corner case with how we compute the packet number of the received frame. To save space, QUIC encodes packet numbers using a variable length integer, and only sends the changed bits in the packet number. This requires that the receiver (openssl) store the largest received pn of the connection, which we nominally do. However, in default_port_packet_handler (where QUIC frames are processed prior to having an established channel allocated) we use a temporary qrx to validate the packet protection of those frames. This temporary qrx may be incorporated into the channel in some cases, but is not in the case of a valid frame that generates an HRR at the TLS layer. In this case, the channel allocates its own qrx independently. When this occurs, the largest_pn value of the temporary qrx is lost, and subsequent frames are unable to be received, as the newly allocated qrx belives that the larges_pn for a given pn_space is 0, rather than the value received in the initial frame (which was a complete 32 bit value, rather than just the changed lower 8 bits). As a result the IV construction produced the wrong value, and the decrypt failed on those subsequent frames. Up to this point, that wasn't even a problem, as most quic implementations start their packet numbering at 0, so the next packet could still have its packet number computed properly. The combination of ngtcp using large random values for initial packet numbers, along with the HRR triggering a separate qrx creation on a channel led to the discovery of this discrepancy. The fix seems pretty straightforward. When we detect in port_default_packet_handler, that we have a separate qrx in the new channel, we migrate processed packets from the temporary qrx to the canonical channel qrx. In addition to doing that, we also need to migrate the largest_pn array from the temporary qrx to the channel_qrx so that subsequent frame reception is guaranteed to compute the received frame packet number properly, and as such, compute the proper IV for packet protection decryption. Fixes openssl/project#1296 Reviewed-by: Saša Nedvědický <[email protected]> Reviewed-by: Matt Caswell <[email protected]> (Merged from #28115)

jogme force-pushed the hello_retry_request_failure branch from 04ce37e to f5f584f Compare July 29, 2025 14:17

t8m previously approved these changes Jul 29, 2025

View reviewed changes

jogme dismissed t8m’s stale review via c653753 July 29, 2025 14:53

jogme force-pushed the hello_retry_request_failure branch 2 times, most recently from c653753 to 3c25001 Compare July 30, 2025 06:27

jogme marked this pull request as draft July 30, 2025 08:42

jogme force-pushed the hello_retry_request_failure branch 2 times, most recently from 020afbb to 7b1643c Compare July 31, 2025 13:56

jogme added 2 commits August 5, 2025 13:18

quicapitest: Check if we can handle HRR

4e20276

Signed-off-by: Norbert Pocs <[email protected]>

quic_channel: Handle HRR and the second transport params extension

0b115ef

When HRR happens a second client hello is sent and it consist of a transport params extension. This must be processed and not cause failure. Fixes: openssl/project#1296 Signed-off-by: Norbert Pocs <[email protected]>

jogme force-pushed the hello_retry_request_failure branch from 7b1643c to 0b115ef Compare August 5, 2025 11:19

jogme marked this pull request as ready for review August 5, 2025 11:19

nhorman approved these changes Aug 6, 2025

View reviewed changes

nhorman requested review from Sashan and mattcaswell August 6, 2025 13:03

mattcaswell reviewed Aug 6, 2025

View reviewed changes

ssl/quic/quic_port.c Show resolved Hide resolved

nhorman added the severity: urgent Fixes an urgent issue (exempt from 24h grace period) label Aug 6, 2025

Sashan approved these changes Aug 6, 2025

View reviewed changes

nhorman added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Aug 6, 2025

nhorman removed branch: 3.2 Applies to openssl-3.2 (EOL) branch: 3.3 Applies to openssl-3.3 branch: 3.4 Applies to openssl-3.4 labels Aug 6, 2025

This was referenced Aug 7, 2025

Backport Fix Hello Retry Request failure in QUIC to 3.5 #28189

Closed

Backport Fix Hello Retry Request failure in QUIC to 3.4 #28190

Closed

jogme closed this Aug 7, 2025

jogme reopened this Aug 7, 2025

jogme removed the branch: 3.5 Applies to openssl-3.5 label Aug 7, 2025

jogme closed this Aug 7, 2025

jogme reopened this Aug 7, 2025

mattcaswell approved these changes Aug 7, 2025

View reviewed changes

nhorman added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Aug 7, 2025

nhorman closed this Aug 7, 2025

Uh oh!

Conversation

jogme commented Jul 29, 2025

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t8m commented Jul 29, 2025

Uh oh!

jogme commented Jul 29, 2025

Uh oh!

nhorman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nhorman commented Aug 6, 2025

Uh oh!

Sashan left a comment

Choose a reason for hiding this comment

Uh oh!

Sashan commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sashan commented Aug 7, 2025

Uh oh!

jogme commented Aug 7, 2025

Uh oh!

mattcaswell commented Aug 7, 2025

Uh oh!

mattcaswell commented Aug 7, 2025

Uh oh!

nhorman commented Aug 7, 2025

Uh oh!

mattcaswell commented Aug 7, 2025

Uh oh!

mattcaswell commented Aug 7, 2025

Uh oh!

openssl-machine commented Aug 7, 2025

Uh oh!

nhorman commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Sashan commented Aug 6, 2025 •

edited

Loading