Fix Hello Retry Request failure in QUIC#28115
Conversation
04ce37e to
f5f584f
Compare
| /* | ||
| * When HRR happens the client sends the transport params in the new client | ||
| * hello again. Reset the transport params here and load them again. | ||
| */ | ||
| if (sc->hello_retry_request != SSL_HRR_NONE && ch->got_remote_transport_params) { | ||
| ch->max_local_streams_bidi = 0; | ||
| ch->max_local_streams_uni = 0; | ||
| ch->got_local_transport_params = 0; | ||
| OPENSSL_free(ch->local_transport_params); | ||
| ch->local_transport_params = NULL; |
There was a problem hiding this comment.
Should we in this case actually check that the transport params as sent in the new client hello are the same as the original ones? Hmm... I did not find anything in regards to that in RFCs so perhaps this is OK as is.
There was a problem hiding this comment.
I'm not sure, as the RFC does not mentions this case (at least I did not found it)
There was a problem hiding this comment.
I think, generally what you are doing for this specific case of HRR is ok, as like @t8m theres nothing concrete in the RFC about this. Generally, you're not allowed to reset your transport params during a connection, but because the HRR is sent over an initial frame and triggers a new initial frame from the client, we're in this state where the connection is created from a QUIC standpoint, but expects to go through the channel establishment code again, as though the second initial packet from the client is acting (sort of) like the first. Its ugly from an RFC standpoint, but at least for now, ok.
That said, I think you may need to do a bit more here than just cleanse the transport parameters. Getting transport parameters implies the setting of several other fields in the channel structure (init_dcid, init_scid, etc), that may be different the second time around (i.e. the DCID in the second client frame is adopted from the SCID provided in the HRR), which may be causing some of your CI failures.
There was a problem hiding this comment.
The CI is failing on the "test setup" when it sets the group list. Which I don't understand why at the moment..
When I try to just return 1 and don't reset the option, the failure remains the same tho
There was a problem hiding this comment.
That said, I think you may need to do a bit more here than just cleanse the transport parameters. Getting transport parameters implies the setting of several other fields in the channel structure (init_dcid, init_scid, etc), that may be different the second time around (i.e. the DCID in the second client frame is adopted from the SCID provided in the HRR), which may be causing some of your CI failures.
The CI failures where FIPS related. Do you still think it would be better to check other fields also? My interpretation is, that it should not change anyway; we could also ignore the newly sent parameters and use the initial ones. Maybe that would be more straightforward. Can client hello retry happen because of change one of the transport parameters (or other parameters we are talking about here)?
There was a problem hiding this comment.
Unsure. I think you'll get it to work with this approach (at least it works for me in a local test). Its probably worth doing this change as you have it and addressing subsequent problems as they occur.
|
CI is failing |
|
I probably use algorithms in the test which are not available at some runners. Will take a look |
c653753 to
3c25001
Compare
020afbb to
7b1643c
Compare
Signed-off-by: Norbert Pocs <[email protected]>
When HRR happens a second client hello is sent and it consist of a transport params extension. This must be processed and not cause failure. Fixes: openssl/project#1296 Signed-off-by: Norbert Pocs <[email protected]>
7b1643c to
0b115ef
Compare
Recently, our overnight QUIC interop runs began failing in CI when an openssl server was tested against an ngtcp2 client: https://github.com/openssl/openssl/actions/runs/16739736813 The underlying cause bears some explination for historical purposes The problem began happening with a recent update to ngtcp2 in which ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused ngtcp to emit a client hello message that offered several groups (including X25519MLKEM768) but only provided a keyshare for x25519. This in turn triggered the openssl server to respond with a hello retry request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2 obliged. However all subsequent frames from the client were discarded by the server, due to failing packet body decryption. The problem was tracked down to a mismatch in the initial vectors used by the client and server, leading to an AEAD tag mismatch. Packet protection keys generate their IV's in QUIC by xoring the packet number of the received frame with the base IV as derived via HKDF in the tls layer. The underlying problem was that openssl hit a very odd corner case with how we compute the packet number of the received frame. To save space, QUIC encodes packet numbers using a variable length integer, and only sends the changed bits in the packet number. This requires that the receiver (openssl) store the largest received pn of the connection, which we nominally do. However, in default_port_packet_handler (where QUIC frames are processed prior to having an established channel allocated) we use a temporary qrx to validate the packet protection of those frames. This temporary qrx may be incorporated into the channel in some cases, but is not in the case of a valid frame that generates an HRR at the TLS layer. In this case, the channel allocates its own qrx independently. When this occurs, the largest_pn value of the temporary qrx is lost, and subsequent frames are unable to be received, as the newly allocated qrx belives that the larges_pn for a given pn_space is 0, rather than the value received in the initial frame (which was a complete 32 bit value, rather than just the changed lower 8 bits). As a result the IV construction produced the wrong value, and the decrypt failed on those subsequent frames. Up to this point, that wasn't even a problem, as most quic implementations start their packet numbering at 0, so the next packet could still have its packet number computed properly. The combination of ngtcp using large random values for initial packet numbers, along with the HRR triggering a separate qrx creation on a channel led to the discovery of this discrepancy. The fix seems pretty straightforward. When we detect in port_default_packet_handler, that we have a separate qrx in the new channel, we migrate processed packets from the temporary qrx to the canonical channel qrx. In addition to doing that, we also need to migrate the largest_pn array from the temporary qrx to the channel_qrx so that subsequent frame reception is guaranteed to compute the received frame packet number properly, and as such, compute the proper IV for packet protection decryption. Fixes openssl/project#1296
nhorman
left a comment
There was a problem hiding this comment.
I imagine that we will need to open separate PR's to adjust this patchset for the backports, but it looks good to me. Nicely done.
| && ch->got_remote_transport_params) { | ||
| ch->max_local_streams_bidi = 0; | ||
| ch->max_local_streams_uni = 0; | ||
| ch->got_local_transport_params = 0; |
There was a problem hiding this comment.
This seems to be a subset of the transport parameters. Why do we not reset all of them?
There was a problem hiding this comment.
The subset was chosen based on the checks below in that function (if set the function fails). The other parameters are just rewritten without a check.
|
marking as, and ok with urgent, as this addresses a standing CI failure |
|
the changes related to In case those changes will be back ported to 3.4 and earlier versions, then only change to |
My yesterday assessment was wrong. the change is relevant to 3.5 only where QUIC server API appeared. The fix here does not matter for 3.4 and earlier because those version provide only client side API for QUIC. It's OpenSSL 3.5 where |
|
because of the MR conflict on 3.5 - creating a separate PR for it: #28189 |
Can we not use the fault injector for this purpose? |
|
Although it looks like the fault injector might need to be updated a bit. It currently uses tserver. |
|
We could maybe use the fault injector, but I think it would be better to use the channel mutator, as we're not trying to inject a packet into the stream, we need to modify an existing initial packet to have a different (large) packet number encoded into it. Just need to do some looking to ensure that the qtx doesn't track packet numbers independently of the wire encoded frame first. |
We are talking about the same thing I think. The channel mutator (ossl_quic_channel_set_mutator) exists in order to support the fault injector. It is exposed via |
|
The idea here is that you set up the objects for a connection via See for example But it does need updating to work with real server objects - not tserver. |
|
24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually. |
|
merged to master, thank you! |
Signed-off-by: Norbert Pocs <[email protected]> Reviewed-by: Saša Nedvědický <[email protected]> Reviewed-by: Neil Horman <[email protected]> Reviewed-by: Matt Caswell <[email protected]> (Merged from #28115)
When HRR happens a second client hello is sent and it consist of a transport params extension. This must be processed and not cause failure. Fixes: openssl/project#1296 Signed-off-by: Norbert Pocs <[email protected]> Reviewed-by: Saša Nedvědický <[email protected]> Reviewed-by: Neil Horman <[email protected]> Reviewed-by: Matt Caswell <[email protected]> (Merged from #28115)
Recently, our overnight QUIC interop runs began failing in CI when an openssl server was tested against an ngtcp2 client: https://github.com/openssl/openssl/actions/runs/16739736813 The underlying cause bears some explination for historical purposes The problem began happening with a recent update to ngtcp2 in which ngtcp2 updated its wolfssl tls backend to support ML-KEM, which caused ngtcp to emit a client hello message that offered several groups (including X25519MLKEM768) but only provided a keyshare for x25519. This in turn triggered the openssl server to respond with a hello retry request (HRR), requesting an ML-KEM keyshare instead, which ngtcp2 obliged. However all subsequent frames from the client were discarded by the server, due to failing packet body decryption. The problem was tracked down to a mismatch in the initial vectors used by the client and server, leading to an AEAD tag mismatch. Packet protection keys generate their IV's in QUIC by xoring the packet number of the received frame with the base IV as derived via HKDF in the tls layer. The underlying problem was that openssl hit a very odd corner case with how we compute the packet number of the received frame. To save space, QUIC encodes packet numbers using a variable length integer, and only sends the changed bits in the packet number. This requires that the receiver (openssl) store the largest received pn of the connection, which we nominally do. However, in default_port_packet_handler (where QUIC frames are processed prior to having an established channel allocated) we use a temporary qrx to validate the packet protection of those frames. This temporary qrx may be incorporated into the channel in some cases, but is not in the case of a valid frame that generates an HRR at the TLS layer. In this case, the channel allocates its own qrx independently. When this occurs, the largest_pn value of the temporary qrx is lost, and subsequent frames are unable to be received, as the newly allocated qrx belives that the larges_pn for a given pn_space is 0, rather than the value received in the initial frame (which was a complete 32 bit value, rather than just the changed lower 8 bits). As a result the IV construction produced the wrong value, and the decrypt failed on those subsequent frames. Up to this point, that wasn't even a problem, as most quic implementations start their packet numbering at 0, so the next packet could still have its packet number computed properly. The combination of ngtcp using large random values for initial packet numbers, along with the HRR triggering a separate qrx creation on a channel led to the discovery of this discrepancy. The fix seems pretty straightforward. When we detect in port_default_packet_handler, that we have a separate qrx in the new channel, we migrate processed packets from the temporary qrx to the canonical channel qrx. In addition to doing that, we also need to migrate the largest_pn array from the temporary qrx to the channel_qrx so that subsequent frame reception is guaranteed to compute the received frame packet number properly, and as such, compute the proper IV for packet protection decryption. Fixes openssl/project#1296 Reviewed-by: Saša Nedvědický <[email protected]> Reviewed-by: Matt Caswell <[email protected]> (Merged from #28115)
When a HRR happens, the clients new hello message contains transport parameters extension which we currently handle with a fail. In this fix I went with the option to drop the first params and reread the new ones. I am not 100% sure if that's the correct way - and not keeping the params from the first client hello - so please correct me if that's wrong.
Unfortunately this does not solve the quic interop CI issue with ngtcp2.
Fixes: openssl/project#1296
Checklist