Skip to content

Conversation

@rzikm
Copy link
Member

@rzikm rzikm commented Nov 3, 2020

This PR adds initial version of almost C#-only QUIC implementation as an alternative to MsQuic based implementation. The work originated as my Master Thesis.

Supported QUIC Protocol Features

This implementation is highly in WIP state. The features are at the draft-27 version. The goal was to obtain a minimal working
implementation able to reliably transmit data between client and server. This can be used to evaluate the viability of purely C# implementation for production releases. Most transport-unrelated features are
left unimplemented.

Currently implemented:

  • Basic connection establishment
  • Encryption
  • TLS integration backed by modified OpenSSL
  • Sending data, flow control, loss recovery, congestion control
  • Stream/Connection termination
  • Coalescing packets into a single UDP datagram

Unsupported QUIC Protocol Features

A list of highlights of unimplemented protocol features follows:

  • Connection migration
  • Stateless reset
  • 0-RTT data
  • Server Preferred Address
  • Version Negotiation
  • Address Validation
  • Encryption key updates

OpenSSL integration

To function correctly, the implementation requires a custom branch of OpenSSL from Akamai
https://github.com/akamai/openssl/tree/OpenSSL_1_1_1g-quic. The implementation tries to detect that
a QUIC-supporting OpenSSL is present in the path. If not, then a mock implementation is used instead.

The mock implementation can be used for trying the implementation locally, but interoperation with other
QUIC implementations is, of course, possible only with the OpenSSL-based TLS. If you want to make sure your
implementation runs with OpenSSL, define the DOTNETQUIC_OPENSSL environment variable. This will
cause the implementation to throw if proper OpenSSL version cannot be loaded.

Tracing and qvis

The implementation can produce traces that can be consumed by https://qvis.edm.uhasselt.be
visualizer. To collect traces, define DOTNETQUIC_TRACE environment variable to something nonempty. The traces will be
saved in the working directory in a file named [timestamp]-[server|client].qvis.

Disabling encryption

Regardless of TLS used, you can circumvent encryption by defining the DOTNETQUIC_NOENCRYPT environment variable to something nonempty.


Ported from dotnet/runtime#43948

@rzikm rzikm changed the title Master managed quic Prototype Managed QUIC support implementation Nov 3, 2020
@geoffkizer
Copy link
Contributor

Thanks for posting @rzikm.... I will review and comment on any issues.

It looks like a couple commits from runtime/master got added in to your PR, likely because you rebased there and then pushed here, and these commits don't exist in runtimelabs yet. I will merge the latest from runtime/master in here and hopefully this will fix that.

@rzikm
Copy link
Member Author

rzikm commented Nov 4, 2020

Yes, I struggled a bit with creating a PR into this repo, since runtimelab and runtime are unrelated in Github's eyes.

Anyway, I am thinking if it would be better to squash all those 250 or so commits. from the multiple rebases and conflict resolutions, it is unlikely that an arbitrary commit in the middle is compilable, let alone functional. thoughts?

@geoffkizer
Copy link
Contributor

geoffkizer commented Nov 4, 2020

Anyway, I am thinking if it would be better to squash all those 250 or so commits. from the multiple rebases and conflict resolutions, it is unlikely that an arbitrary commit in the middle is compilable, let alone functional. thoughts?

It's probably worth squashing at least to some degree. There's always a tradeoff here of maintaining the history for documentation purposes (e.g., see exactly what we changed to enable some feature) vs the cost of doing so and complexity of having so many commits.

If you want to squash it all to one commit, I'd be fine with that. If you want to squash it down to some set of steps (i.e. commits) that retains some useful history/context that we might case about preserving for future information, I'm fine with that too. Up to you.

EDIT: That said, please don't squash the changes that got picked up from runtime/master :) The merge should take care of those, hopefully...

@geoffkizer
Copy link
Contributor

BTW, quick note: You shouldn't need #nullable enable in most of these files (at least source files; test code is different) because it's enabled at the project level.

@geoffkizer
Copy link
Contributor

CI is failing with this error: .packages\microsoft.dotnet.build.tasks.targetframework.sdk\6.0.0-beta.20529.1\build\Microsoft.DotNet.Build.Tasks.TargetFramework.Sdk.targets(66,5): error MSB3202: (NETCORE_ENGINEERING_TELEMETRY=Build) The project file "....\System.Runtime.InteropServices.JavaScript/src/System.Runtime.InteropServices.JavaScript.csproj" was not found.

Do you know if that's related to this PR or not? It's possible that it's a pre-existing issue since this branch just got set up. I'll investigate once I get the merging stuff figured out, but thought I'd mention it now.

@geoffkizer
Copy link
Contributor

BTW, the merge will also bring over this change: dotnet/runtime#44223

This should simplify handling of the "browser" build, but may cause merge conflicts with your PR, not sure...

@geoffkizer
Copy link
Contributor

It looks like there are a bunch of changes to clean up "using" statements in the MsQuic and Mock providers. We should remove these changes as it's likely to just lead to merge conflicts.

@geoffkizer
Copy link
Contributor

Similarly, there are some changes to System.Net.Http, and some changes to the shared QUIC code (like adding a QuicConnection constructor). These may be good and/or necessary changes, but we probably shouldn't be making them here.

We should either make these changes in runtime/master or drop them.

In general, we should minimize changes in this tree outside of the managed implementation proper. If they are good changes, they should go into runtime/master. And making those sort of changes here can lead to merge nightmares later.

@rzikm rzikm force-pushed the master-managed-quic branch from 9e01d37 to 4fdfb5c Compare November 7, 2020 17:48
@geoffkizer
Copy link
Contributor

Looks like there are a few issues with Quic functional tests. Do you know what's happening here?

Also, looks like there are a bunch of issues with Http functional tests -- I'm guessing there is some basic issue here, haven't investigated though...

@rzikm
Copy link
Member Author

rzikm commented Nov 7, 2020

Also, looks like there are a bunch of issues with Http functional tests -- I'm guessing there is some basic issue here, haven't investigated though...
That is baffling, I have reverted all changes in the System.Net.Http subtree, so it should behave exactly as in upstream

Looks like there are a few issues with Quic functional tests. Do you know what's happening here?
I will try to investigate sometime soon

@geoffkizer
Copy link
Contributor

The reason the Http functional tests are failing seems to be that you've changed QuicImplementationProviders.Default to be the managed provider, and it returns true for IsSupported. There are some HTTP version selection tests that run conditionally based on this, and try to actually verify the expected version selection behavior by issuing HTTP requests against the loopback server.

It's not entirely clear to me whether these tests are actually correct or not -- some of them look a little wonky. But regardless, they weren't running in CI before because QuicImplementationProviders.Default.IsSupported was returning false, and now it's returning true, so they are running.

@geoffkizer
Copy link
Contributor

It's probably fine to change the default QUIC provider to the managed provider in this branch -- after all that's the point of this branch.

But what seems undesirable is that, when OpenSsl is not available, the managed provider will fall back to its mock TLS implementation, which means that code (like these tests) will think they can actually use QUIC when what they are actually getting is mock TLS QUIC.

I think we should separate out the mock TLS handling into a separate QUIC provider. That is, we should have:
(1) A managed QUIC provider that does real QUIC. This can be the default. This depends on OpenSsl, so it will return false from IsSupported if it can't find OpenSsl.
(2) A mock TLS managed QUIC provider -- something like ManagedMockTls. This doesn't depend on OpenSsl, so should always return true for IsSupported.

And we should run tests against both of these. The real managed provider won't run in CI, at least for now, because there's no OpenSsl. But at least the Mock TLS provider will run, which is valuable.

@geoffkizer
Copy link
Contributor

BTW, I verified that changing the default provider back to MsQuic makes the Http tests pass, because Default.IsSupported returns false now.

@geoffkizer
Copy link
Contributor

I've manually ported the QUIC API changes back to dotnet/runtime. See here: dotnet/runtime#44380

Once this gets in, I will merge it to the feature branch here and we should be able to rebase and get rid of the API changes in this PR.

@geoffkizer
Copy link
Contributor

FWIW, I can't easily reproduce any of the QUIC test failures locally. They seem to happen pretty consistently in CI. This usually means that the failures only happen under high CPU load, as occurs in CI.

@geoffkizer
Copy link
Contributor

geoffkizer commented Nov 8, 2020

Though I do notice, the functional tests take much longer than they used to with just the mock provider -- ~60s on my machine vs ~3s with just the mock provider.

Any idea why the tests take so long?

@geoffkizer
Copy link
Contributor

We should enable the new stream conformance tests for the managed provider. They are here: https://github.com/dotnet/runtimelab/blob/feature/ManagedQuic/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicStreamConnectedStreamConformanceTests.cs

I tried doing this locally, but unfortunately I got a bunch of errors and some test hangs. I'm guessing the flush behavior on managed QuicStream is causing problems.

We may want to just add auto-flushing on every stream write, as a way to get these tests passing.

@geoffkizer
Copy link
Contributor

Also re conformance tests: Several tests are currently disabled because of the non-standard behavior of QuicStream on dispose. That's fine for now, but the plan is to fix the behavior of QuicStream on dispose to align with standard stream behavior. We should make this change for managed QuicStream and then re-enable these tests.

I recently merged a PR to do this for the mock provider, see here: dotnet/runtime#44282

@geoffkizer
Copy link
Contributor

We should also enable some of the HTTP3 tests in System.Net.Http to run against the managed provider.

The tests are here: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Http/tests/FunctionalTests/SocketsHttpHandlerTest.cs#L3008

However, we should address issues with the stream conformance tests before attempting this.

@geoffkizer
Copy link
Contributor

@stephentoub
Copy link
Member

I'm guessing the flush behavior on managed QuicStream is causing problems.

There's a property you can override on the test base class to say that writes require flushing.

@geoffkizer
Copy link
Contributor

There's a property you can override on the test base class to say that writes require flushing.

Good point, thanks. Unfortunately, this doesn't seem to help. There's some other issue going on here.

@rzikm
Copy link
Member Author

rzikm commented Dec 15, 2020

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 293 in repo dotnet/runtimelab

@rzikm
Copy link
Member Author

rzikm commented Dec 18, 2020

@geoffkizer, In the last test run, all QUIC tests passed, the only one failed test is in Sockets and HTTP2

https://helix.dot.net/api/2019-06-17/jobs/9e1d5026-ee81-4786-99cb-bfcfd335697e/workitems/System.Net.Http.Functional.Tests/console

This feels like random unrelated fail. Could you rerun the pipeline?

@rzikm rzikm force-pushed the master-managed-quic branch from 00307d6 to eb8faa9 Compare December 19, 2020 09:54
@rzikm rzikm force-pushed the master-managed-quic branch from e7a69ba to 72c8205 Compare December 22, 2020 22:53
@geoffkizer
Copy link
Contributor

Sorry, lost track of this. It looks like there are only minor failures in CI. Do you know what the issue here is?

@rzikm
Copy link
Member Author

rzikm commented Feb 16, 2021

I did not have much time to spend on it because I was too busy writing the master thesis text. But there seems to be some race condition leading to deadlock/livelock of either client or server. Sadly, it does not seem to be isolated to a single test that could be disabled.

What I find quite strange is that the probability of failure seems to depend on the order in which the tests are run. I will try to investigate again this week when I have time.

@DemiMarie
Copy link

I did not have much time to spend on it because I was too busy writing the master thesis text. But there seems to be some race condition leading to deadlock/livelock of either client or server. Sadly, it does not seem to be isolated to a single test that could be disabled.

What I find quite strange is that the probability of failure seems to depend on the order in which the tests are run. I will try to investigate again this week when I have time.

The Linux netem framework (part of tc) is fantastic for this. I used it to find quinn-rs/quinn#692, which was a bug in another QUIC implementation with similar symptoms.

@rzikm
Copy link
Member Author

rzikm commented Feb 16, 2021

The Linux netem framework (part of tc) is fantastic for this. I used it to find quinn-rs/quinn#692, which was a bug in another QUIC implementation with similar symptoms.

I am working primarily on Windows now, but simulating arbitrary network issues might help tracking this down, or at least reproduce it on my machine (as far as I know the tests fail only on CI). Thanks for the idea.

@DemiMarie
Copy link

The Linux netem framework (part of tc) is fantastic for this. I used it to find quinn-rs/quinn#692, which was a bug in another QUIC implementation with similar symptoms.

I am working primarily on Windows now, but simulating arbitrary network issues might help tracking this down, or at least reproduce it on my machine (as far as I know the tests fail only on CI). Thanks for the idea.

You’re welcome!

@rzikm rzikm closed this Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants