Skip to content

Fix Hello Retry Request failure in QUIC#28115

Closed
jogme wants to merge 3 commits intoopenssl:masterfrom
jogme:hello_retry_request_failure
Closed

Fix Hello Retry Request failure in QUIC#28115
jogme wants to merge 3 commits intoopenssl:masterfrom
jogme:hello_retry_request_failure

Conversation

@jogme
Copy link
Contributor

@jogme jogme commented Jul 29, 2025

When a HRR happens, the clients new hello message contains transport parameters extension which we currently handle with a fail. In this fix I went with the option to drop the first params and reread the new ones. I am not 100% sure if that's the correct way - and not keeping the params from the first client hello - so please correct me if that's wrong.

Unfortunately this does not solve the quic interop CI issue with ngtcp2.

Fixes: openssl/project#1296

Checklist
  • documentation is added or updated
  • tests are added or updated

@jogme jogme force-pushed the hello_retry_request_failure branch from 04ce37e to f5f584f Compare July 29, 2025 14:17
t8m
t8m previously approved these changes Jul 29, 2025
Comment on lines 1336 to 1346
/*
* When HRR happens the client sends the transport params in the new client
* hello again. Reset the transport params here and load them again.
*/
if (sc->hello_retry_request != SSL_HRR_NONE && ch->got_remote_transport_params) {
ch->max_local_streams_bidi = 0;
ch->max_local_streams_uni = 0;
ch->got_local_transport_params = 0;
OPENSSL_free(ch->local_transport_params);
ch->local_transport_params = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we in this case actually check that the transport params as sent in the new client hello are the same as the original ones? Hmm... I did not find anything in regards to that in RFCs so perhaps this is OK as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, as the RFC does not mentions this case (at least I did not found it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, generally what you are doing for this specific case of HRR is ok, as like @t8m theres nothing concrete in the RFC about this. Generally, you're not allowed to reset your transport params during a connection, but because the HRR is sent over an initial frame and triggers a new initial frame from the client, we're in this state where the connection is created from a QUIC standpoint, but expects to go through the channel establishment code again, as though the second initial packet from the client is acting (sort of) like the first. Its ugly from an RFC standpoint, but at least for now, ok.

That said, I think you may need to do a bit more here than just cleanse the transport parameters. Getting transport parameters implies the setting of several other fields in the channel structure (init_dcid, init_scid, etc), that may be different the second time around (i.e. the DCID in the second client frame is adopted from the SCID provided in the HRR), which may be causing some of your CI failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is failing on the "test setup" when it sets the group list. Which I don't understand why at the moment..

When I try to just return 1 and don't reset the option, the failure remains the same tho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I think you may need to do a bit more here than just cleanse the transport parameters. Getting transport parameters implies the setting of several other fields in the channel structure (init_dcid, init_scid, etc), that may be different the second time around (i.e. the DCID in the second client frame is adopted from the SCID provided in the HRR), which may be causing some of your CI failures.

The CI failures where FIPS related. Do you still think it would be better to check other fields also? My interpretation is, that it should not change anyway; we could also ignore the newly sent parameters and use the initial ones. Maybe that would be more straightforward. Can client hello retry happen because of change one of the transport parameters (or other parameters we are talking about here)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure. I think you'll get it to work with this approach (at least it works for me in a local test). Its probably worth doing this change as you have it and addressing subsequent problems as they occur.

@t8m t8m added branch: master Applies to master branch triaged: bug The issue/pr is/fixes a bug tests: present The PR has suitable tests present branch: 3.2 Applies to openssl-3.2 (EOL) branch: 3.3 Applies to openssl-3.3 branch: 3.4 Applies to openssl-3.4 branch: 3.5 Applies to openssl-3.5 approval: review pending This pull request needs review by a committer labels Jul 29, 2025
@t8m
Copy link
Member

t8m commented Jul 29, 2025

CI is failing

@jogme
Copy link
Contributor Author

jogme commented Jul 29, 2025

I probably use algorithms in the test which are not available at some runners. Will take a look

@jogme jogme force-pushed the hello_retry_request_failure branch 2 times, most recently from c653753 to 3c25001 Compare July 30, 2025 06:27
@jogme jogme marked this pull request as draft July 30, 2025 08:42
@jogme jogme force-pushed the hello_retry_request_failure branch 2 times, most recently from 020afbb to 7b1643c Compare July 31, 2025 13:56
jogme added 2 commits August 5, 2025 13:18
When HRR happens a second client hello is sent and it consist of a
transport params extension. This must be processed and not cause
failure.

Fixes: openssl/project#1296

Signed-off-by: Norbert Pocs <[email protected]>
@jogme jogme force-pushed the hello_retry_request_failure branch from 7b1643c to 0b115ef Compare August 5, 2025 11:19
@jogme jogme marked this pull request as ready for review August 5, 2025 11:19
Recently, our overnight QUIC interop runs began failing in CI when an
openssl server was tested against an ngtcp2 client:
https://github.com/openssl/openssl/actions/runs/16739736813

The underlying cause bears some explination for historical purposes

The problem began happening with a recent update to ngtcp2 in which
ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused
ngtcp to emit a client hello message that offered several groups
(including X25519MLKEM768) but only provided a keyshare for x25519.
This in turn triggered the openssl server to respond with a hello retry
request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2
obliged. However all subsequent frames from the client were discarded by
the server, due to failing packet body decryption.

The problem was tracked down to a mismatch in the initial vectors used
by the client and server, leading to an AEAD tag mismatch.

Packet protection keys generate their IV's in QUIC by xoring the packet
number of the received frame with the base IV as derived via HKDF in the
tls layer.

The underlying problem was that openssl hit a very odd corner case with
how we compute the packet number of the received frame.  To save space,
QUIC encodes packet numbers using a variable length integer, and only
sends the changed bits in the packet number.  This requires that the
receiver (openssl) store the largest received pn of the connection,
which we nominally do.

However, in default_port_packet_handler (where QUIC frames are processed
prior to having an established channel allocated) we use a temporary qrx
to validate the packet protection of those frames.  This temporary qrx
may be incorporated into the channel in some cases, but is not in the
case of a valid frame that generates an HRR at the TLS layer.  In this
case, the channel allocates its own qrx independently.  When this
occurs, the largest_pn value of the temporary qrx is lost, and
subsequent frames are unable to be received, as the newly allocated qrx
belives that the larges_pn for a given pn_space is 0, rather than the
value received in the initial frame (which was a complete 32 bit value,
rather than just the changed lower 8 bits).  As a result the IV
construction produced the wrong value, and the decrypt failed on those
subsequent frames.

Up to this point, that wasn't even a problem, as most quic
implementations start their packet numbering at 0, so the next packet
could still have its packet number computed properly.  The combination
of ngtcp using large random values for initial packet numbers, along
with the HRR triggering a separate qrx creation on a channel led to the
discovery of this discrepancy.

The fix seems pretty straightforward.  When we detect in
port_default_packet_handler, that we have a separate qrx in the new
channel, we migrate processed packets from the temporary qrx to the
canonical channel qrx.  In addition to doing that, we also need to
migrate the largest_pn array from the temporary qrx to the channel_qrx
so that subsequent frame reception is guaranteed to compute the received
frame packet number properly, and as such, compute the proper IV for
packet protection decryption.

Fixes openssl/project#1296
Copy link
Contributor

@nhorman nhorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that we will need to open separate PR's to adjust this patchset for the backports, but it looks good to me. Nicely done.

@nhorman nhorman requested review from Sashan and mattcaswell August 6, 2025 13:03
&& ch->got_remote_transport_params) {
ch->max_local_streams_bidi = 0;
ch->max_local_streams_uni = 0;
ch->got_local_transport_params = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a subset of the transport parameters. Why do we not reset all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subset was chosen based on the checks below in that function (if set the function fails). The other parameters are just rewritten without a check.

@nhorman nhorman added the severity: urgent Fixes an urgent issue (exempt from 24h grace period) label Aug 6, 2025
@nhorman
Copy link
Contributor

nhorman commented Aug 6, 2025

marking as, and ok with urgent, as this addresses a standing CI failure

Copy link
Contributor

@Sashan Sashan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.

@nhorman nhorman added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Aug 6, 2025
@Sashan
Copy link
Contributor

Sashan commented Aug 6, 2025

the changes related to port_default_packet_handler() are relevant for for 3.5 and later. The earlier versions of OpenSSL don't create qrx to let port to validate packet header before channel is created.

In case those changes will be back ported to 3.4 and earlier versions, then only change to ssl/quic/quic_channel.c is relevant.

@nhorman nhorman removed branch: 3.2 Applies to openssl-3.2 (EOL) branch: 3.3 Applies to openssl-3.3 branch: 3.4 Applies to openssl-3.4 labels Aug 6, 2025
@Sashan
Copy link
Contributor

Sashan commented Aug 7, 2025

the changes related to port_default_packet_handler() are relevant for for 3.5 and later. The earlier versions of OpenSSL don't create qrx to let port to validate packet header before channel is created.

In case those changes will be back ported to 3.4 and earlier versions, then only change to ssl/quic/quic_channel.c is relevant.

My yesterday assessment was wrong. the change is relevant to 3.5 only where QUIC server API appeared. The fix here does not matter for 3.4 and earlier because those version provide only client side API for QUIC.

It's OpenSSL 3.5 where OSSL_QUIC_server_method(3) appears.

@jogme jogme closed this Aug 7, 2025
@jogme jogme reopened this Aug 7, 2025
@jogme jogme removed the branch: 3.5 Applies to openssl-3.5 label Aug 7, 2025
@jogme
Copy link
Contributor Author

jogme commented Aug 7, 2025

because of the MR conflict on 3.5 - creating a separate PR for it: #28189

@jogme jogme closed this Aug 7, 2025
@jogme jogme reopened this Aug 7, 2025
@mattcaswell
Copy link
Member

We should have a separate test for that, but injecting specific intial packet numbers is not something our quic test suite does easily

Can we not use the fault injector for this purpose?

@mattcaswell
Copy link
Member

Although it looks like the fault injector might need to be updated a bit. It currently uses tserver.

@nhorman
Copy link
Contributor

nhorman commented Aug 7, 2025

We could maybe use the fault injector, but I think it would be better to use the channel mutator, as we're not trying to inject a packet into the stream, we need to modify an existing initial packet to have a different (large) packet number encoded into it. Just need to do some looking to ensure that the qtx doesn't track packet numbers independently of the wire encoded frame first.

@mattcaswell
Copy link
Member

We could maybe use the fault injector, but I think it would be better to use the channel mutator, as we're not trying to inject a packet into the stream, we need to modify an existing initial packet to have a different (large) packet number encoded into it. Just need to do some looking to ensure that the qtx doesn't track packet numbers independently of the wire encoded frame first.

We are talking about the same thing I think. The channel mutator (ossl_quic_channel_set_mutator) exists in order to support the fault injector. It is exposed via ossl_quic_tserver_set_plain_packet_mutator which is there because of the fault injector's packet_plain_mutate callback.

@mattcaswell
Copy link
Member

The idea here is that you set up the objects for a connection via qtest_create_quic_objects. This gives you a handle on the fault injector which enables you to set a callback when packets arrive via qtest_fault_set_packet_plain_listener. Then you complete the connection via qtest_create_quic_connection. Your callback has the opportunity to mutate the packets in whichever way you see fit.

See for example test_ncid_frame in quic_newcid_test.c. That uses the callback to inject a new frame - but you can equally mutate the data in the packet.

But it does need updating to work with real server objects - not tserver.

@openssl-machine
Copy link
Collaborator

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

@nhorman nhorman added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Aug 7, 2025
@nhorman
Copy link
Contributor

nhorman commented Aug 7, 2025

merged to master, thank you!

openssl-machine pushed a commit that referenced this pull request Aug 7, 2025
Signed-off-by: Norbert Pocs <[email protected]>

Reviewed-by: Saša Nedvědický <[email protected]>
Reviewed-by: Neil Horman <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from #28115)
openssl-machine pushed a commit that referenced this pull request Aug 7, 2025
When HRR happens a second client hello is sent and it consist of a
transport params extension. This must be processed and not cause
failure.

Fixes: openssl/project#1296

Signed-off-by: Norbert Pocs <[email protected]>

Reviewed-by: Saša Nedvědický <[email protected]>
Reviewed-by: Neil Horman <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from #28115)
openssl-machine pushed a commit that referenced this pull request Aug 7, 2025
Recently, our overnight QUIC interop runs began failing in CI when an
openssl server was tested against an ngtcp2 client:
https://github.com/openssl/openssl/actions/runs/16739736813

The underlying cause bears some explination for historical purposes

The problem began happening with a recent update to ngtcp2 in which
ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused
ngtcp to emit a client hello message that offered several groups
(including X25519MLKEM768) but only provided a keyshare for x25519.
This in turn triggered the openssl server to respond with a hello retry
request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2
obliged. However all subsequent frames from the client were discarded by
the server, due to failing packet body decryption.

The problem was tracked down to a mismatch in the initial vectors used
by the client and server, leading to an AEAD tag mismatch.

Packet protection keys generate their IV's in QUIC by xoring the packet
number of the received frame with the base IV as derived via HKDF in the
tls layer.

The underlying problem was that openssl hit a very odd corner case with
how we compute the packet number of the received frame.  To save space,
QUIC encodes packet numbers using a variable length integer, and only
sends the changed bits in the packet number.  This requires that the
receiver (openssl) store the largest received pn of the connection,
which we nominally do.

However, in default_port_packet_handler (where QUIC frames are processed
prior to having an established channel allocated) we use a temporary qrx
to validate the packet protection of those frames.  This temporary qrx
may be incorporated into the channel in some cases, but is not in the
case of a valid frame that generates an HRR at the TLS layer.  In this
case, the channel allocates its own qrx independently.  When this
occurs, the largest_pn value of the temporary qrx is lost, and
subsequent frames are unable to be received, as the newly allocated qrx
belives that the larges_pn for a given pn_space is 0, rather than the
value received in the initial frame (which was a complete 32 bit value,
rather than just the changed lower 8 bits).  As a result the IV
construction produced the wrong value, and the decrypt failed on those
subsequent frames.

Up to this point, that wasn't even a problem, as most quic
implementations start their packet numbering at 0, so the next packet
could still have its packet number computed properly.  The combination
of ngtcp using large random values for initial packet numbers, along
with the HRR triggering a separate qrx creation on a channel led to the
discovery of this discrepancy.

The fix seems pretty straightforward.  When we detect in
port_default_packet_handler, that we have a separate qrx in the new
channel, we migrate processed packets from the temporary qrx to the
canonical channel qrx.  In addition to doing that, we also need to
migrate the largest_pn array from the temporary qrx to the channel_qrx
so that subsequent frame reception is guaranteed to compute the received
frame packet number properly, and as such, compute the proper IV for
packet protection decryption.

Fixes openssl/project#1296

Reviewed-by: Saša Nedvědický <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from #28115)
@nhorman nhorman closed this Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Applies to master branch severity: urgent Fixes an urgent issue (exempt from 24h grace period) tests: present The PR has suitable tests present triaged: bug The issue/pr is/fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix QUIC issue with HELLO_RETRY_REQUEST dropping the connection

6 participants