Implementation of per-client UDP readloops, take 2#295
Conversation
Codecov ReportBase: 68.24% // Head: 68.36% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #295 +/- ##
==========================================
+ Coverage 68.24% 68.36% +0.12%
==========================================
Files 39 39
Lines 2500 2500
==========================================
+ Hits 1706 1709 +3
+ Misses 661 659 -2
+ Partials 133 132 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
Sorry @rg0now, that I didnt had time to review the PR yet. |
e59a85a to
8f08e30
Compare
Frankly, I would opt for this one, just because it is much less intrusive and more secure. The other PR requires some massive reorg all around the place plus it is still susceptible to eating up all CPUs. At some point we may consider implementing per-user quotas (see the note), which would fix the security issue at least, but I'm still uncomfortable with the massive change. I've added a note to examples/README.md to point users to the new example and also applied some minor stylistic changes throughout. |
|
Currently, the WASM pipeline fails as the new multihreaded examples as excluded all files from the package via build-tags. |
stv0g
left a comment
There was a problem hiding this comment.
I agree with your assessment regarding the two readLoops PRs :)
Also great work on this one to keep it simple and concise 👍
I am just wondering, if we maybe want to include this not just as an example but into the package itself. Do you have an opinion on that?
I fear that this might kick-off a whole bunch of server variants. Each with its own turn.NewXXXServer() function and turn.XXXServerConfig type.
Thats something, I would like to avoid to keep it maintainable..
095a15d to
2f6db80
Compare
I wish I understood this but I can't. I guess the CI errs due to the new soft-dependency I've started to think about whether the whole hassle is worth the price: I mean, adding a new dependency (which is not even portable) just so that we can demonstrate a use case that may not be too interesting outside a small audience somehow seems wrong. Maybe an example is an overkill? Should I just post a gist and hope Google picks it up? Wdyt?
I'm not sure we want to touch the API and/or |
I agree. Lets merge it like this.
Please try to add a
I think its worth the effort. 👍 |
2f6db80 to
651120a
Compare
Done. Thanks a lot for the review work, it's extremely useful, I learn a lot. |
|
Seems like your last change is missing some entries in Can you run |
Add a demo to show how to use multiple sockets with the SO_REUSEPORT socket option to scale TURN/UDP servers to multiple CPUs.
651120a to
68141f5
Compare
|
I'm terribly sorry, hope this version will be fine now. |
|
Thanks a lot :) |
This is the second attempt to address #287, see also #288
Problem: The specific problem we are trying to address is to scale TURN/UDP server listeners beyond a single CPU. Currently we allocate a single global net.PacketConn per TURN/UDP listener that is shared across all clients. Since there is only a single readLoop to drain the global PacketConn, TURN/UDP listeners are practically limited to a single CPU. See more on this here.
Prior attempts: The first draft PR #288 addresses this issue by forking off a separate readloop thread per each client, which allows the server to drain each client connection on a separate CPU thread. In order to implement this in a secure way #288 introduces some fairly intrusive changes to the main TURN server implementation: there is an API change where the user must provide a
Connectcallback that the server calls after a new TURN allocation is created to make the per-client connection, and there is some major refactoring ininternal/serveras well to make this possible. Meanwhile, a security problem is still lurking in the background: without the implementation of per-user allocation quotas it is still possible for a single high-bandwidth attacker to starve the rest of the clients by creating multiple TURN allocations and consuming excess CPU.Goal: This PR takes a completely different approach: the aim is to scale UDP/TURN listeners beyond the single-CPU limit but meanwhile (1) minimize the changes required in the existing pion/turn implementation (the PR requires absolutely no change to the server code) and (2) allow the operator to tightly control the number of CPUs allocated to each TURN/UDP listener.
Implementation: The idea is that the caller creates a fixed number of UDP listener sockets that share the same local addr:port by using the
SO_REUSEPORTsocket option and then initializes a separate TURN listener per each socket. Note that when multiple sockets share the same local address withSO_REUSEPORTthen the kernel will load-balance packets across the sockets by the IP five-tuple hash (see here), which makes sure that each packet of a TURN allocation will be processed by the same readLoop and so the allocation manager doesn't need to be shared across CPU threads. By controlling the number of sockets, the caller can control the number of CPUs allocated to each listener. Since this requires no change at all in the TURN server code (every socket will be handled by pion/turn as a separate UDP/TURN listener), the PR contains only a simple extension of the example server (seeexamples/turn-server/mult-threaded-udp/main.go) to implement the caller-side changes.The pros are simplicity and control over CPU consumption. The cons are (1) that
SO_REUSEPORTis not portable (we don't see this as a problem: recall that nothing in the server knows anything about whetherSO_REUSEPORTis in use, all required changes are on the caller side) and (2) performance is slightly smaller than with #288. This is because hash-based load-balancing used by the kernel to distribute TURN allocations across the CPU threads is prone to an unequal distribution of load across worker threads: this can be taken care of by implementing an adaptive load-balancing algorithm with eBPF (say, using the power-of-two choices algorithm).Performance: Attached is a script to benchmark the naive TURN server with (
mult-threaded-udp) and without (simple) the patch. Requires the turncat TURN client (you can obtain from here) in the source dir, plus iperfv2 in the PATH. On a 24-core Intel server and 8 iperf threads, the built-in simple server (./test2.sh simple 16000000) produces roughly 41 kpps (thousand packets per second) packet rate and uses about 141% CPU, while the server using 4SO_REUSEPORTsockets and so 4 readLoop threads (./test.sh mult-threaded-udp 16000000) produces 121 kpps with 390% CPU usage. Observe that both the performance and the CPU usage is smaller than with #288, this is because of the uneven load-distribution (the difference between the least-loaded thread and the most-loaded thread can be more than 50% in our benchmarks).Feedback appreciated.
test2.sh.gz