The problem with prefetch
At the kernel level, performance often comes down to cache behavior. Memory references which must actually be satisfied by memory are extremely slow; good performance requires that needed data be in a CPU cache much of the time. The kernel goes out of its way to use cache-hot memory when possible; there has also been some significant work put into tasks like reordering structures so that fields that are commonly accessed together are found in the same cache line. As a general rule, these optimizations have helped performance in measurable ways.
Cache misses are often unavoidable, but it is sometimes possible to attempt to reduce their cost. If the kernel knows that it will be accessing memory at a particular location in the near future, it can use a CPU-specific prefetch instruction to begin the process of bringing the data into cache. This instruction is made available to kernel code via the generic prefetch() function; developers have made heavy use of it. Consider, for example, this commonly-used macro from <linux/list.h>:
#define list_for_each(pos, head) \
for (pos = (head)->next; prefetch(pos->next), pos != (head); \
pos = pos->next)
This macro (in a number of variants) is used to traverse a linked list. The idea behind the prefetch() call here is to begin the process of fetching the next entry in the list while the current entry is being processed. Hopefully by the time the next loop iteration starts, the data will have arrived - or, at least, it will be in transit. Linked lists are known to be cache-unfriendly data structures, so it makes sense that this type of optimization can help to speed things up.
Except that it doesn't - at least, not on x86 processors.
Andi Kleen may have been the first to question this optimization when he tried to remove the prefetches from list operations last September. His patch generated little discussion, though, and apparently fell through the cracks. Recently, Linus did some profiling on one of his favorite workloads (kernel builds) and found that the prefetch instructions were at the top of the ranking. Performing the prefetching cost time, and that time was not being repaid through better cache behavior; simply removing the prefetch() calls made the build go faster.
Ingo Molnar, being Ingo, jumped in and did a week's worth of research in an hour or so. Using perf and a slightly tweaked kernel, he was able to verify that using the prefetch instructions caused a performance loss of about 0.5%. That is not a headline-inspiring performance regression, certainly, but this is an optimization which was supposed to make things go faster. Clearly something is not working the way that people thought it was.
Linus pointed out one problem at the outset: his test involved a lot of traversals of singly-linked hlist hash table lists. Those lists tend to be short, so there is not much scope for prefetching; in fact, much of the time, the only prefetch attempted used the null pointer that indicates the end of the list. Prefetching with a null pointer seems silly, but it's also costly: evidently every such prefetch on x86 machines (and, seemingly, ARM as well) causes a translation lookaside buffer miss and a pipeline stall. Ingo measured this effect and came to the conclusion that each null prefetch cost about 20 processor cycles.
Clearly, null prefetches are a bad idea. It would be nice if the CPU would simply ignore attempts to prefetch using a null pointer, but that's not how things are, so, as is often the case, one ends up trying to solve the problem in software instead. Ingo did some testing with a version of prefetch() which would only issue prefetch instructions for non-null pointers; that version did, indeed, perform better. But it still performed measurably worse than simply skipping the prefetching altogether.
CPU designers are well aware of the cost of waiting for memory; they have put a great deal of effort into minimizing that cost whenever possible. Among other things, contemporary CPUs have their own memory prefetch units which attempt to predict which memory will be wanted next and start the process of retrieving it early. One thing Ingo noticed in his tests is that, even without any software prefetch operations, the number of prefetch operations run by the CPU was about the same. So the hardware prefetcher was busy during this time - and it was doing a better job than the software at deciding what to fetch. Throwing explicit prefetch operations into the mix, it seems, just had the effect of interfering with what the hardware was trying to do.
Ingo summarized his results this way:
One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September. Chances are good that other prefetch operations will be removed as well. There will still be a place for prefetch() in the kernel, but only in specific situations where it can be clearly shown to help performance. As with other low-level optimizations (likely() comes to mind), tossing in a prefetch because it seems like it might help is often not the right thing to do.
One other lesson to be found in this experience is that numbers matter.
Andi was right when he wanted to remove these operations, but he did not
succeed in getting his patch merged. One could come up with a number of
reasons why things went differently this time, starting with the fact that
Linus took an interest in the problem. But it's also true that
performance-oriented patches really need to come with numbers to show that
they are achieving the desired effect; had Andi taken the time to quantify
the impact of his change, he would have had a stronger case for merging
it.
| Index entries for this article | |
|---|---|
| Kernel | Prefetch |
