Implementation of per-client UDP readloops#288
Conversation
Aim is to allow the server to maintain a specific per-client UDP readloop thread instead if the global readloop thread we maintain today for improving performance. - add a new Connect() callback to PacketConnConfig to teach the server how to obtain a connected socket to a client - move readLoop to internal/server/readLoop.go - split server.Request struct into server.State (stuff needed by the readLoop) and server.Request (the rest) that embeds server.State - let the server invoke the Connect callback once it successfully created an TURN allocation for a client to obtain a per-client socket and spawn the readloop for the new socket - add a WrapConn implementation for net.PacketConn that wraps the net.Conn returned by the DiapUDP - add examples/turn-server/udp-connect/main.go to demonstrate the use of the new feature
Codecov ReportBase: 68.65% // Head: 67.52% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #288 +/- ##
==========================================
- Coverage 68.65% 67.52% -1.13%
==========================================
Files 38 39 +1
Lines 2469 2522 +53
==========================================
+ Hits 1695 1703 +8
- Misses 641 685 +44
- Partials 133 134 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
What's the CPU usage like when running on a single core? |
|
Depends on the traffic rate and the CPU, but in general UDP listeners can handle about 30-60 thousands packets/sec (roughly 100-200 Mbps depending on the packet size) per CPU-core. For comparison, a pure-C TURN can server handle about 100-300 thousand pps and our optimized DPDK-based UDP proxy (no TURN) reaches 10-20 million(!) pps per CPU-core. |
|
I meant: how much does this patch reduce single-core performance? |
|
It doesn't. Theoretically, it shouldn't, and in practice my super-rudimentary iperf benchmarks confirm this (62500 pps with 20-30 usec mean delay on localhost both with and without the patch). |
|
Ah, cool. I'll do my best to find time to review your patch, then. |
|
You're going to hate me for this but I think I will double your review load: it seems that we found a much less intrusive solution, see #295. The performance is a bit smaller but there are no changes in the server code and the required changes seem more secure. |
|
@rg0now Are you okay with it if we close this PR? |
Already wanted to suggest this... Please go ahead |
This is a first attempt to address #287.
The aim is to allow the server to maintain a specific per-client UDP readloop thread instead of one global readloop, which would allow it to drain each client connection on a separate CPU thread. See more on this here.
Problem: Currently pion/turn server-side performance is limited at about 40-50 kpps per TURN/UDP listener. This is because we allocate a single global
net.PacketConnper UDP listener, which is then drained by a single CPU go-routine. This means that all client allocations made via that listener will share the same CPU thread and there is no way to load-balance client allocations across CPUs. This only affects TURN/UDP: for TCP, TLS and DTLS the TURN sockets are connected back to the client and therefore a separate CPU go-routine is created for each allocation.Proposed solution: Create a separate
net.PacketConnper each UDP client-allocation, by (1) sharing the same listener server socket usingSO_REUSEADDR, (2) connecting each per-allocation connection back to the client, and (3) firing up a separate read-loop/go-routine per each client socket.Security: Care must be taken in implementing this plan: if we blindly create a new socket per received UDP packet then a simple UDP portscan will DoS the TURN listener. In the proposed solution the per-client socket is created only after a client has made a successful allocation, which rules out the blind port-scan problem since the attacker must at least have a set of valid TURN credentials. It is still not completely safe against DoS attacks though, in that an attacker in possession of valid credentials can still open thousands of server sockets, but this is at least the same level of security as provided by the TURN/TCP implementation.
Implementation: Some fairly intrusive changes are required to support this. The patch set in particular:
Performance: Attached is a super-simple script to benchmark a naive TURN server with (
udp-connect) and without (simple) the patch. Requires theturncatTURN client (you can obtain from here) in the source dir, plus iperfv2 in the PATH. The script opens 4 client connections and runs a UDP iperf benchmark in each, then in every 5 seconds it reports the cumulative packet rate through the tested server. On a 24-core Intel server, the built-insimpleserver (./test.sh simple 30000000) produced 34816 pps packet rate and used 141.1% CPU (recall, there are 4 clients connected to the same listener so theoretically the server could scale out to 4 CPUs), while the server with the patches (./test.sh udp-connect 30000000) produced 140194 pps with 472% CPU usage.Feedback appreciated.
test.sh.gz