Skip to content

Implementation of per-client UDP readloops, take 2#295

Merged
stv0g merged 1 commit intopion:masterfrom
l7mp:feature-udp-per-client-readloop-2
Feb 22, 2023
Merged

Implementation of per-client UDP readloops, take 2#295
stv0g merged 1 commit intopion:masterfrom
l7mp:feature-udp-per-client-readloop-2

Conversation

@rg0now
Copy link
Copy Markdown
Contributor

@rg0now rg0now commented Jan 30, 2023

This is the second attempt to address #287, see also #288

Problem: The specific problem we are trying to address is to scale TURN/UDP server listeners beyond a single CPU. Currently we allocate a single global net.PacketConn per TURN/UDP listener that is shared across all clients. Since there is only a single readLoop to drain the global PacketConn, TURN/UDP listeners are practically limited to a single CPU. See more on this here.

Prior attempts: The first draft PR #288 addresses this issue by forking off a separate readloop thread per each client, which allows the server to drain each client connection on a separate CPU thread. In order to implement this in a secure way #288 introduces some fairly intrusive changes to the main TURN server implementation: there is an API change where the user must provide a Connect callback that the server calls after a new TURN allocation is created to make the per-client connection, and there is some major refactoring in internal/server as well to make this possible. Meanwhile, a security problem is still lurking in the background: without the implementation of per-user allocation quotas it is still possible for a single high-bandwidth attacker to starve the rest of the clients by creating multiple TURN allocations and consuming excess CPU.

Goal: This PR takes a completely different approach: the aim is to scale UDP/TURN listeners beyond the single-CPU limit but meanwhile (1) minimize the changes required in the existing pion/turn implementation (the PR requires absolutely no change to the server code) and (2) allow the operator to tightly control the number of CPUs allocated to each TURN/UDP listener.

Implementation: The idea is that the caller creates a fixed number of UDP listener sockets that share the same local addr:port by using the SO_REUSEPORT socket option and then initializes a separate TURN listener per each socket. Note that when multiple sockets share the same local address with SO_REUSEPORT then the kernel will load-balance packets across the sockets by the IP five-tuple hash (see here), which makes sure that each packet of a TURN allocation will be processed by the same readLoop and so the allocation manager doesn't need to be shared across CPU threads. By controlling the number of sockets, the caller can control the number of CPUs allocated to each listener. Since this requires no change at all in the TURN server code (every socket will be handled by pion/turn as a separate UDP/TURN listener), the PR contains only a simple extension of the example server (see examples/turn-server/mult-threaded-udp/main.go) to implement the caller-side changes.

The pros are simplicity and control over CPU consumption. The cons are (1) that SO_REUSEPORT is not portable (we don't see this as a problem: recall that nothing in the server knows anything about whether SO_REUSEPORT is in use, all required changes are on the caller side) and (2) performance is slightly smaller than with #288. This is because hash-based load-balancing used by the kernel to distribute TURN allocations across the CPU threads is prone to an unequal distribution of load across worker threads: this can be taken care of by implementing an adaptive load-balancing algorithm with eBPF (say, using the power-of-two choices algorithm).

Performance: Attached is a script to benchmark the naive TURN server with (mult-threaded-udp) and without (simple) the patch. Requires the turncat TURN client (you can obtain from here) in the source dir, plus iperfv2 in the PATH. On a 24-core Intel server and 8 iperf threads, the built-in simple server (./test2.sh simple 16000000) produces roughly 41 kpps (thousand packets per second) packet rate and uses about 141% CPU, while the server using 4 SO_REUSEPORT sockets and so 4 readLoop threads (./test.sh mult-threaded-udp 16000000) produces 121 kpps with 390% CPU usage. Observe that both the performance and the CPU usage is smaller than with #288, this is because of the uneven load-distribution (the difference between the least-loaded thread and the most-loaded thread can be more than 50% in our benchmarks).

Feedback appreciated.
test2.sh.gz

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 30, 2023

Codecov Report

Base: 68.24% // Head: 68.36% // Increases project coverage by +0.12% 🎉

Coverage data is based on head (68141f5) compared to base (e63897d).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
+ Coverage   68.24%   68.36%   +0.12%     
==========================================
  Files          39       39              
  Lines        2500     2500              
==========================================
+ Hits         1706     1709       +3     
+ Misses        661      659       -2     
+ Partials      133      132       -1     
Flag Coverage Δ
go 68.36% <ø> (+0.12%) ⬆️
wasm 44.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
internal/server/turn.go 60.37% <0.00%> (+1.11%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@stv0g
Copy link
Copy Markdown
Member

stv0g commented Feb 2, 2023

Sorry @rg0now, that I didnt had time to review the PR yet.
I will do so later today or tomorrow. But I am currious, which of the two approaches do you prefer?

@rg0now rg0now force-pushed the feature-udp-per-client-readloop-2 branch from e59a85a to 8f08e30 Compare February 2, 2023 19:38
@rg0now
Copy link
Copy Markdown
Contributor Author

rg0now commented Feb 2, 2023

But I am curious, which of the two approaches do you prefer?

Frankly, I would opt for this one, just because it is much less intrusive and more secure. The other PR requires some massive reorg all around the place plus it is still susceptible to eating up all CPUs. At some point we may consider implementing per-user quotas (see the note), which would fix the security issue at least, but I'm still uncomfortable with the massive change.

I've added a note to examples/README.md to point users to the new example and also applied some minor stylistic changes throughout.

@stv0g
Copy link
Copy Markdown
Member

stv0g commented Feb 4, 2023

Currently, the WASM pipeline fails as the new multihreaded examples as excluded all files from the package via build-tags.
I suggest you just add an empty file into the directory which is not excluded.

https://github.com/pion/turn/actions/runs/4078041150/jobs/7027770548

Copy link
Copy Markdown
Member

@stv0g stv0g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your assessment regarding the two readLoops PRs :)

Also great work on this one to keep it simple and concise 👍

I am just wondering, if we maybe want to include this not just as an example but into the package itself. Do you have an opinion on that?

I fear that this might kick-off a whole bunch of server variants. Each with its own turn.NewXXXServer() function and turn.XXXServerConfig type.

Thats something, I would like to avoid to keep it maintainable..

@stv0g stv0g added the enhancement New feature or request label Feb 4, 2023
@rg0now rg0now force-pushed the feature-udp-per-client-readloop-2 branch from 095a15d to 2f6db80 Compare February 21, 2023 17:04
@rg0now
Copy link
Copy Markdown
Contributor Author

rg0now commented Feb 21, 2023

Currently, the WASM pipeline fails as the new multihreaded examples as excluded all files from the package via build-tags.
I suggest you just add an empty file into the directory which is not excluded.

I wish I understood this but I can't. I guess the CI errs due to the new soft-dependency golang.org/x/sys, is this somehow incompatible with WASM (would make sense)? where should I put an empty file?

I've started to think about whether the whole hassle is worth the price: I mean, adding a new dependency (which is not even portable) just so that we can demonstrate a use case that may not be too interesting outside a small audience somehow seems wrong. Maybe an example is an overkill? Should I just post a gist and hope Google picks it up? Wdyt?

I am just wondering, if we maybe want to include this not just as an example but into the package itself. Do you have an opinion on that?
I fear that this might kick-off a whole bunch of server variants. Each with its own turn.NewXXXServer() function and turn.XXXServerConfig type.

I'm not sure we want to touch the API and/or turn.Server per se: it seems that we get reasonable performance without touching the code in any way, just by calling the existing API in a weird-ish way. Plus this is applicable only to UDP (TCP, TLS and DTLS spawn per-allocation readLoops), so I guess there's no need to add the specific turn.NewXXXServer() functions. Wdyt?

@stv0g
Copy link
Copy Markdown
Member

stv0g commented Feb 22, 2023

I'm not sure we want to touch the API and/or turn.Server per se

I agree. Lets merge it like this.

I wish I understood this but I can't. I guess the CI errs due to the new soft-dependency golang.org/x/sys, is this somehow incompatible with WASM (would make sense)?

Please try to add a //go:build !wasm to the main.go of your new example. I dont think we should attempt to build this for WASM

I've started to think about whether the whole hassle is worth the price: I mean, adding a new dependency (which is not even portable) just so that we can demonstrate a use case that may not be too interesting outside a small audience somehow seems wrong. Maybe an example is an overkill? Should I just post a gist and hope Google picks it up? Wdyt?

I think its worth the effort. 👍

@rg0now rg0now force-pushed the feature-udp-per-client-readloop-2 branch from 2f6db80 to 651120a Compare February 22, 2023 10:19
@rg0now
Copy link
Copy Markdown
Contributor Author

rg0now commented Feb 22, 2023

Please try to add a //go:build !wasm to the main.go of your new example. I dont think we should attempt to build this for WASM

Done. Thanks a lot for the review work, it's extremely useful, I learn a lot.

@stv0g
Copy link
Copy Markdown
Member

stv0g commented Feb 22, 2023

Seems like your last change is missing some entries in go.sum:

stv0g@ubuntu-linux-22-04-desktop:~/workspace/pion/turn$ go test ./...
go: golang.org/x/[email protected]: missing go.sum entry; to add it:
	go mod download golang.org/x/sys
go: downloading github.com/pion/transport/v2 v2.0.2
go: golang.org/x/[email protected]: missing go.sum entry; to add it:

Can you run go mod tidy and push again? Thanks a lot.

Add a demo to show how to use multiple sockets with the SO_REUSEPORT
socket option to scale TURN/UDP servers to multiple CPUs.
@rg0now rg0now force-pushed the feature-udp-per-client-readloop-2 branch from 651120a to 68141f5 Compare February 22, 2023 11:21
@rg0now
Copy link
Copy Markdown
Contributor Author

rg0now commented Feb 22, 2023

I'm terribly sorry, hope this version will be fine now.

@stv0g stv0g merged commit b0f8b53 into pion:master Feb 22, 2023
@stv0g
Copy link
Copy Markdown
Member

stv0g commented Feb 22, 2023

Thanks a lot :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

2 participants