Improving Linux networking performance

By Jonathan Corbet
January 13, 2015

LCA 2015

100Gb network adapters are coming, said Jesper Brouer in his talk at the LCA 2015 kernel miniconference (slides [PDF]). Driving such adapters at their full wire speed is going to be a significant challenge for the Linux kernel; meeting that challenge is the subject of his current and future work. The good news is that Linux networking has gotten quite a bit faster as a result — even if there are still problems to be solved.

The challenge

As network adapters get faster, the time between packets (i.e. the time the kernel has to process each packet) gets smaller. With current 10Gb adapters, there are 1,230ns between two 1538-byte packets. 40Gb networking cuts that time down significantly, to 307ns. Naturally, 100Gb exacerbates the problem, dropping the per-packet time to about 120ns; the interface, at this point, is processing 8.15 million packets per second. That does not leave a lot of time to figure out what to do with each packet.

So what do you do if you, like almost all of us, do not have a 100Gb adapter around to play with? You use a 10Gb adapter with small frames instead. The smallest Ethernet frame that can be sent is 84 bytes; on a 10G adapter, Jesper said, there are 67.2ns between minimally-sized packets. A system that can cope with that kind load should be positioned to do something reasonable with 100Gb networking when it becomes available. But coping with that load is hard: on a 3GHz CPU, there are only about 200 CPU cycles available for the processing of each packet. That, Jesper noted, is not a lot.

The kernel has traditionally not done a great job with this kind of network-intensive workload. That has led to existence of a number of out-of-tree networking implementations that bypass the kernel's network stack entirely. The demand for such systems indicates that the kernel is not using the hardware optimally; the out-of-tree implementations can drive adapters at full wire speed from a single CPU, which the mainline kernel is hard-put to do.

The problem, Jesper said, is that the kernel developers have focused on scaling out to large numbers of cores. In the process, they have been able to hide regressions in per-core efficiency. The networking stack, as a result, works well for many workloads, but workloads that are especially latency-sensitive have suffered. The kernel, today, can only forward something between 1M and 2M packets per core every second, while some of the bypass alternatives approach a rate of 15M packets per core per second.

Time budgets

If you are going to address this kind of problem, you have to take a hard look at the cost of every step in the processing of a packet. So, for example, a cache miss on Jesper's 3GHz processor takes about 32ns to resolve. It thus only takes two misses to wipe out the entire time budget for processing a packet. Given that a socket buffer ("SKB") occupies four cache lines on a 64-bit system and that much of the SKB is written during packet processing, the first part of the problem is apparent — four cache misses would consume far more than the time available.

Beyond that, using the x86 LOCK prefix for atomic operations takes about 8.25ns. In practice, that means that the shortest spinlock lock/unlock cycle takes a little over 16ns. So there is not room for a lot of locking within the time budget.

Then there is the cost of performing a system call. On a system with SELinux and auditing enabled, that cost is just over 75ns — over the time budget on its own. Disabling auditing and SELinux reduces the time required to just under 42ns, which is better, but that is still a big part of the time budget. There are ways of amortizing that cost over multiple packets; they include system calls like sendmmsg(), recvmmsg(), sendfile(), and splice(). In practice, he said, they do not work as well as he expected, but he did not get into why. From the audience, Christoph Lameter noted that latency-sensitive users tend to use the InfiniBand "IB verbs" mechanism.

Given all of these costs, Jesper asked, how do the network-bypass solutions achieve higher performance? The key appears to be batching of operations, along with preallocation and prefetching of resources. These solutions keep work CPU-local and avoid locking. It is also important to shrink packet metadata and reduce the number of system calls. Faster, cache-optimal data structures also help. Of all of these techniques, batching of operations is the most important. A cost that is intolerable on a per-packet basis is easier to absorb if it is incurred once per dozens of packets. 16ns of locking per packet hurts; if sixteen packets are processed at once, that overhead drops to 1ns per packet.

Improving batching

So, unsurprisingly, Jesper's work has been focused on improving batching in the networking layer. It includes the TCP bulk transmission work that was covered here in October; see that article for details on how it works. In short, it is a mechanism for informing network drivers that there are more packets waiting for transmission, allowing the driver to delay expensive operations until all of those packets have been queued. With this work in place, his system can transmit 14.8M packets per second — at least if it's the same little packet sent over and over again.

The tricky part, he said, is adding batching APIs to the networking stack without increasing the latency of the system. Latency and throughput must often be traded off against each other; here the objective is to optimize both. An especially hard trick to resist is speculative transmission delays — a bet that another packet is coming soon. Such tricks tend to improve benchmark results but are less useful for real-world workloads.

Batching can — and should — be done at multiple layers in the stack. So, for example, the queuing discipline ("qdisc") subsystem is a good place for batching; after all, delays are already happening as the result of queueing. In the best case, currently, the qdisc code requires six LOCK operations per packet — 48ns of pure locking overhead. The full cost of queuing a packet is 58-68ns, so the bulk of the time spent is with locking operations. Jesper has worked to add batching, spreading that cost over multiple packets, but that only works if there is actually a queue of packets.

The nominal fast path through the qdisc code happens when there is no queue; in such situations, packets can often be passed directly to the network interface and not queued at all. Currently, such packets incur the cost of all six LOCK operations. It should, he said, be possible to do better. A lockless qdisc subsystem could eliminate almost all the cost of queuing packets. Jesper has a test implementation to demonstrate what can be done; eliminating a minimum of 48ns of overhead, he said, is well worth doing.

While transmission performance now looks reasonably good, he said, receive processing can still do with some improvement. A highly tuned setup can receive a maximum of about 6.5M packets per second — and that's when the packets are simply being dropped after reception. Some work on optimizing the receive path is underway, raising that maximum to just over 9M packets per second. But there is a problem with this benchmark: it doesn't show the costs of interaction with the memory-management subsystem.

Memory management

And that interaction, it turns out, is painful. The network stack's receive path, it seems, has some behavioral patterns that do not bring out the best behavior in the slab allocators. The receive code can allocate space for up to 64 packets at a time, while the transmit path can free packets in batches of up to 256. These pattern seems to put the SLUB allocator, in particular, into a relatively slow path. Jesper did some microbenchmarking and found that a single kmem_cache_alloc() call followed by kmam_cache_free() required about 19ns. But when 256 allocations and frees were done, that time increased to 40ns. In real-world use in the networking stack, though, where other things are being done as well, the allocation/free overhead grows even more, to 77ns — more than the time budget on its own.

Thus, Jesper concluded, there need to be either improvements to the memory-management code or some way of bypassing it altogether. To try the latter approach, he implemented a subsystem called qmempool; it does bulk allocation and free operations in a lockless manner. With qmempool, he was able to save 12ns in simple tests, and up to 40ns in packet forwarding tests. There are a number of techniques used in qmempool to make it faster, but the killer feature is the batching of operations.

Jesper wound down by saying that qmempool was implemented as a sort of provocation: he wanted to show what was possible and inspire the memory-management developers to do something about it. The response from the memory-management camp was covered in the next talk, which will be reported on separately.

[Your editor would like to thank linux.conf.au for funding his travel to the event.]

Index entries for this article
Kernel	Networking/Performance
Conference	linux.conf.au/2015

to post comments

Cache misses and pinning

Posted Jan 15, 2015 10:18 UTC (Thu) by epa (subscriber, #39769) [Link] (4 responses)

a cache miss on Jesper's 3GHz processor takes about 32ns to resolve

Is there anything you can do on modern processors to pin certain parts of memory into the cache? You might ask the processor to set aside a certain proportion of its L1 or L2 cache to always cover some fixed pages of memory, and then make sure your time-intensive code and its working data fits in that section. Further, you might pin that code to a single CPU and somehow ban other CPUs in the machine from even looking at that area of memory, so they have no need to worry about cache coherency.

I am sure this would make performance worse in 99% of cases, but if there are real-time requirements associated with particular hardware (like a network card) it might let you guarantee you can complete some work in a fixed time period.

Cache misses and pinning

Posted Jan 15, 2015 14:25 UTC (Thu) by etienne (guest, #25256) [Link] (2 responses)

> on modern processors to pin certain parts of memory into the cache?

Some ARMs system-on-chip have local memory (as fast as the cache - so to be used as uncached memory) inside the chip, but it is always difficult for a generic solution to depend on a specific system-on-chip type and version...

Cache misses and pinning

Posted Jan 16, 2015 5:31 UTC (Fri) by marcH (subscriber, #57642) [Link]

Not just ARM, this seems to fetch a few other examples:

https://www.google.com/search?q=embedded+sram+cpu

> but it is always difficult for a generic solution to depend on a specific system-on-chip type and version...

Indeed.

Cache misses and pinning

Posted Jan 17, 2015 20:42 UTC (Sat) by jzbiciak (guest, #5246) [Link]

TI's Keystone2 processors do have a large, coherent on-chip SRAM, but it's not as fast as L2. To get fast-as-cache RAM in an ARM design, you have to connect it as TCM. The Cortex-A15 doesn't support TCM. I don't know about other Cortex-A family members, however. I suppose you could browse around ARM's site.

In the KS2 platform, the on-chip SRAM is very low latency compared to a fast DDR3 (about 1/4th the latency or less, and several times the bandwidth), and is fully coherent with the ARM Cortex-A15 cache, so it does make a good candidate for certain classes of buffers. It isn't, however, as fast as the L2 cache.

My understanding is that Linux doesn't really support allocating hot buffers in onchip SRAM very well, however. At least, that's the impression I got from the folks implementing drivers for the device.

Full disclosure: I work at Texas Instruments as a memory system architect on the team that produced Keystone and Keystone2. Also, we now have a wide range of devices under the KS2 banner now. The most relevant one to this topic is this one and related spinoffs. All statements above are mine and not my employer's, yadda yadda...

Cache misses and pinning; DDIO

Posted Jan 20, 2015 17:13 UTC (Tue) by rstonehouse (subscriber, #81531) [Link]

> Is there anything you can do on modern processors to pin certain parts of memory into the cache?

No; but Intel Sandy-Bridge and later have data-direct IO (DDIO). This means that a bus-mastered DMA from a network adapter to an address that is resident in L3 will update it. Note that this only works on the NUMA node/socket where the PCIe interface is connected.

This happens transparently to the kernel.

Also I believe that via DDIO a small portion of the cache lines can be allocated on write (upto 10%?)

References:
http://www.intel.co.uk/content/dam/www/public/us/en/docum...
http://www.realworldtech.com/sandy-bridge-ep/2/

Packet filtering

Posted Jan 15, 2015 22:13 UTC (Thu) by jhhaller (guest, #56103) [Link]

Many configurations need to send at least some packets through nftables. This is unlikely to contribute to meeting packet processing objectives. Many of the high speed implementations either segregate high speed flows at the NIC level, or by separating flows within the NIC using capabilities like MACVLAN to isolate a flow into a different queue by MAC address or VLAN. Without taking advantage of NIC capabilities to direct packet flows to receive queues processed on different processor cores, it will be hard to run some packets through nftables while sending others to a high volume application.

Improving Linux networking performance

Posted Jan 16, 2015 5:38 UTC (Fri) by marcH (subscriber, #57642) [Link] (3 responses)

> Given all of these costs, Jesper asked, how do the network-bypass solutions achieve higher performance? The key appears to be batching of operations,

... or sometimes even simpler: ditch Ethernet compatibility and use (much) bigger packets? Excuse my ignorance but isn't that the definition of Infiniband? Traffic internal to a cluster does not care about compatibility.

Improving Linux networking performance

Posted Jan 16, 2015 5:47 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Nope, InfiniBand is limited to 4k packets. That's actually less than Ethernet Jumbo packets (10k).

You can't really get away from lots of small packets on a router that serves many clients.

Improving Linux networking performance

Posted Jan 16, 2015 6:33 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

I think you actually mean short in time; as opposed to small in size. And from that respect packets have become way too short.

4k max, really?

Improving Linux networking performance

Posted Jan 16, 2015 7:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Yep. InfiniBand is not often actually used for bulk transfer. It's more known for its small round-trip time enabling its use in high-performance MPI clusters or even for RDMA.

Havent we just about reached game over?

Posted Jan 16, 2015 23:00 UTC (Fri) by Max.Hyre (subscriber, #1054) [Link] (18 responses)

     Given that general-purpose processor speeds have stagnated for a decade or two, and line data rates just keep growing, there will come a point at which it will be impossible for a CPU to do the job.
     If you're really serious about 100 Gbps speeds (and how far behind can 1 Tbps be?) it seems to me that we're simply going to have dedicated higher-protocol-layer processors with DMA directly into userspace. The CPU hands it a destination address & port, and the userspace address and size, and has nothing more to do with things, other than to wake up the user program when another packet or seventeen shows up.
     Mainframes have had this sort of thing for ages, and the fact is that our CPUs have become, effectively, mainframes themselves, and will need some of the same solutions.

Havent we just about reached game over?

Posted Jan 16, 2015 23:50 UTC (Fri) by dlang (guest, #313) [Link] (16 responses)

how is userspace going to handle the data at those speeds? a single core can only do so much work, be in in userspace or in the kernel

The only solution to the long-term trend is to make it so you can have more than one core working on the data.

Havent we just about reached game over?

Posted Jan 17, 2015 0:19 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

It's not that easy. You might need to access flow-specific state (for NAT or other stateful firewall activities) and so you need to steer flows towards a specific CPU.

Havent we just about reached game over?

Posted Jan 17, 2015 0:31 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

I didn't say that having multiple cores working on the traffic was easy, if it was we would be doing it already. bit it's the only way to solve the problem the parent was looking at, which is the thought that the network wire speed is increasing faster than processor speed and there will get a point where the wire speed is faster than the processor. (assuming that these trends continue as-is)

Havent we just about reached game over?

Posted Jan 19, 2015 0:26 UTC (Mon) by mtaht (guest, #11087) [Link]

Or: you can think really hard about improving instructions per clock and reducing call, locking, and context switch overhead, in the CPU.

http://millcomputing.com/wiki/Memory

Havent we just about reached game over?

Posted Jan 17, 2015 0:55 UTC (Sat) by zlynx (guest, #2285) [Link]

We already have the cores doing different programs and working on different data. I can, for example, run a game (Windows 8.1) and compile Linux software in a virtual machine at the same time.

So in a 100 Gbps to the home scenario:
There will be Program A downloading streaming 8K video and displaying it on the wall screen.
Program B will be running BitTorrent using as many cores at it needs.
Program C will be doing Folding@Home.
Program D does its portion of shared world state compute for the Distributed MMO the user is gaming on. That will display on his smaller 30" desktop monitor.

The networking hardware will read the IPv6 address and port information of each packet and distribute it into the appropriate memory pools. Each set of cores only has to work on a part of the incoming data.

Havent we just about reached game over?

Posted Jan 17, 2015 17:14 UTC (Sat) by raven667 (subscriber, #5198) [Link] (10 responses)

For a long time I think that the answer is that there is nothing that a single computer is going to do at 100Gb/s as an endpoint service except maybe static file transfer, any 100Gb/s computers are going to be fancy routers/switches/loadbalancers and actual end points are going to be on 10Gb/s links. There just isn't much need to 100Gb/s flows, 100Gb/s is good for aggregation of 10G and smaller links but not as an endpoint technology, not for a long time, until terabit interfaces are common on the core at least.

For regular edge connections there is no pressure to go faster than 1G and for server datacenter connections there is no pressure to go faster than 10G, 40G is currently popular for aggregation in the datacenter but it's only a half-step and will be useless in a couple of years once 100G interfaces come down in price, Ethernet upgrades are dominated by factor-of-10 speed increases.

Havent we just about reached game over?

Posted Jan 22, 2015 17:04 UTC (Thu) by Funcan (guest, #44209) [Link] (9 responses)

My home NAS can trivially saturate 1G doing day-to-day work such as backups. I think it is rather disingenuous to say there is no pressure to go faster than 1G - you can get 1G internet access at home in many places now, and for moving local files around etc 1G is already slow.

Havent we just about reached game over?

Posted Jan 22, 2015 18:01 UTC (Thu) by raven667 (subscriber, #5198) [Link] (8 responses)

You must live in a different place than I do, 1G internet service for home users is so rare around here that it is effectively a myth. Even in that case, there is little point in having a 10G interface which costs $250 on each end that uplinks at 1G. Sure you have a server in your house that can saturate 1G for backups, that still doesn't mean the cost/benefit is there to overhaul the infrastructure just to optimize this use case, the default for client devices is going to be 1G for a long time, not 10G. Even SATA3 interfaces top out at 6G, to give a sense of perspective. Even a large file like a DVD ISO takes less than a minute to transfer (37 seconds), presuming you aren't on spinning rust and have seeks which destroy throughput and make the interface speed moot, how is that "slow"?

I'm not seeing file sizes increasing to the point that client connections are sufficiently overtaxed at 1G to be worth investing in an edge upgrade (unlike the upgrades from 10M to 100M to 1G where the day to day benefit was clear).

Havent we just about reached game over?

Posted Jan 22, 2015 22:39 UTC (Thu) by justincormack (subscriber, #70439) [Link]

10G is just starting to get cheap, the first on motherboard 10G ports are available, and some of the new SoC type chips are shipping with onboard 10G. Cheap switches are now $1000 for 8 ports. Expect rapid falls from here, as happened with 1G.

Havent we just about reached game over?

Posted Jan 23, 2015 10:44 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (6 responses)

I've had this argument with people before. In my country the expectation is that most people will get a 30Mbps or better service over the next few years. But to some advocates this is irresponsible because they've looked at the trend lines and they see a need for 1Gbps, 10Gbps, or more. And when you ask them what that would mean, what the application would be for a home user needing 1Gbps they just go all slack jawed. They've never thought about what that trend line _means_ they just followed the line, like people buying into a pumped penny stock.

There is a small amount of 1Gbps symmetric fibre in my country. The advocates say everywhere should have it and the government ought to pay. But they go really quiet when they're asked how many of the people already offered 1Gbps symmetric buy that. And they go even quieter when asked how many of _those_ are really using that bandwidth, rather than just buying it because they can - like choosing the more expensive wine at the restaurant even when you can't taste the difference.

In most areas there's a lot of FTTC VDSL2 offered at 40Mbps or 80Mbps. The big clue as to how this market will behave in future is that people do not buy the slightly more expensive 80Mbps in large numbers. They try 40Mbps, discover that's plenty and never reconsider. They can watch Netflix HD fine, and still read their email, so why would they pay more?

Havent we just about reached game over?

Posted Jan 23, 2015 18:03 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> They can watch Netflix HD fine
Yet 4k movies are already marginal on 40M. If you have two TVs then even 80M is marginal.

In a couple of years you might want 8k 3D movies.

Havent we just about reached game over?

Posted Jan 23, 2015 20:16 UTC (Fri) by raven667 (subscriber, #5198) [Link] (4 responses)

Maybe, I kind of doubt that people are going to care enough to spend money on 4K sets to make 4K video anything other than a niche product. Like 3D TVs were a couple of years ago this is just another scheme for electronics makers to shift more product but like 3D TV I don't think this is getting much traction.

Havent we just about reached game over?

Posted Jan 23, 2015 20:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

I have a 42" 4k TV right in front of me. I bought it for a little less than $1100. In a couple of years they will be commonplace.

And anyway, the point is that 40Mb connections are not comfortably sufficient even now. In 10 years? They definitely won't be enough for every conceivable application.

Havent we just about reached game over?

Posted Jan 23, 2015 21:02 UTC (Fri) by magila (guest, #49627) [Link] (2 responses)

Most people don't sit close enough to their TVs for 4K to make a noticeable difference over 1080p, especially if said TV is only 42". And where did you get the idea that 4K requires anywhere near 40Mb? With reasonable compression a 4K stream can easily be encoded at less than half of that. 3D TVs already flopped in the market, 8K 3D is pure fantasy.

Havent we just about reached game over?

Posted Jan 23, 2015 22:46 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

I remember people saying this about 1080p. Like nobody's going to need it and DVD quality is good enough.

I can see 1080p pixel artifacts on 50" just fine, 4k is visually much better. Then there's 120Hz movies and 3D. Then multiply that by 3 (several TVs in one house).

So 1Gb network connection is not unreasonable.

Havent we just about reached game over?

Posted Jan 24, 2015 2:39 UTC (Sat) by zlynx (guest, #2285) [Link]

I sit about 3 feet from my 50" 4K. I use it as a computer monitor. Great for games but the text is also nice.

It is a huge difference over 1080p because 1080 is not enough resolution for a 50" computer monitor. The pixels are too big.

Havent we just about reached game over?

Posted Jan 20, 2015 7:01 UTC (Tue) by marcH (subscriber, #57642) [Link]

> The only solution to the long-term trend is to make it so you can have more than one core working on the data.

Think about this crazy idea... connect 100 servers to a 100Gb/s switch and TADA! you have many cores working on 100Gb/s. Bonus point: you can scale massively.

Basically; like raven667 said.

Now more to the point: what type of 100Gb/s network workload requires shared memory consistency?

Since shared memory consistency does not scale (in addition to being impossible for mere mortals to program correctly), the sweet spot is narrow and applications which benefit from it are few. Oracle, OK... how many others?

Now, among these few, which one(s) requires 100Gb/s and has to use generic Linux code, not being able to use of the existing, specialized alternatives?

The research looks highly technical, fun and impressive. Like most research, I'm sure it will have both expected and unintended benefits. I'm just curious if this is trying to solve any actual performance problem right now or not just yet.

Havent we just about reached game over?

Posted Jul 24, 2018 20:01 UTC (Tue) by jackksmith21006 (guest, #124191) [Link]

This can be done with a capibilities based kernel that uses handles instead of the kernel to control resources.

The new fuchsia kernel, zircon, looks to be such a kernel.

Video of this talk?

Posted Jan 16, 2015 23:44 UTC (Fri) by piotrjurkiewicz (guest, #96438) [Link] (1 responses)

I have been looking for a video of this talk on https://www.youtube.com/user/linuxconfau2015/videos but I don't see it :( Is there any chance that they will post a video? Was this talk recorded?

Video of this talk?

Posted Jan 18, 2015 20:35 UTC (Sun) by StevenEllis (guest, #74851) [Link]

Video was poasted overnight

http://youtu.be/3XG9-X777Jo