The problem with prefetch

By Jonathan Corbet
May 24, 2011

Over time, software developers tend to learn that micro-optimization efforts are generally not worthwhile, especially in the absence of hard data pointing out a specific problem. Performance problems are often not where we think they are, so undirected attempts to tweak things to make them go faster can be entirely ineffective. Or, indeed, they can make things worse. That is a lesson that the kernel developers have just relearned.

At the kernel level, performance often comes down to cache behavior. Memory references which must actually be satisfied by memory are extremely slow; good performance requires that needed data be in a CPU cache much of the time. The kernel goes out of its way to use cache-hot memory when possible; there has also been some significant work put into tasks like reordering structures so that fields that are commonly accessed together are found in the same cache line. As a general rule, these optimizations have helped performance in measurable ways.

Cache misses are often unavoidable, but it is sometimes possible to attempt to reduce their cost. If the kernel knows that it will be accessing memory at a particular location in the near future, it can use a CPU-specific prefetch instruction to begin the process of bringing the data into cache. This instruction is made available to kernel code via the generic prefetch() function; developers have made heavy use of it. Consider, for example, this commonly-used macro from <linux/list.h>:

    #define list_for_each(pos, head) \
	for (pos = (head)->next; prefetch(pos->next), pos != (head); \
            pos = pos->next)

This macro (in a number of variants) is used to traverse a linked list. The idea behind the prefetch() call here is to begin the process of fetching the next entry in the list while the current entry is being processed. Hopefully by the time the next loop iteration starts, the data will have arrived - or, at least, it will be in transit. Linked lists are known to be cache-unfriendly data structures, so it makes sense that this type of optimization can help to speed things up.

Except that it doesn't - at least, not on x86 processors.

Andi Kleen may have been the first to question this optimization when he tried to remove the prefetches from list operations last September. His patch generated little discussion, though, and apparently fell through the cracks. Recently, Linus did some profiling on one of his favorite workloads (kernel builds) and found that the prefetch instructions were at the top of the ranking. Performing the prefetching cost time, and that time was not being repaid through better cache behavior; simply removing the prefetch() calls made the build go faster.

Ingo Molnar, being Ingo, jumped in and did a week's worth of research in an hour or so. Using perf and a slightly tweaked kernel, he was able to verify that using the prefetch instructions caused a performance loss of about 0.5%. That is not a headline-inspiring performance regression, certainly, but this is an optimization which was supposed to make things go faster. Clearly something is not working the way that people thought it was.

Linus pointed out one problem at the outset: his test involved a lot of traversals of singly-linked hlist hash table lists. Those lists tend to be short, so there is not much scope for prefetching; in fact, much of the time, the only prefetch attempted used the null pointer that indicates the end of the list. Prefetching with a null pointer seems silly, but it's also costly: evidently every such prefetch on x86 machines (and, seemingly, ARM as well) causes a translation lookaside buffer miss and a pipeline stall. Ingo measured this effect and came to the conclusion that each null prefetch cost about 20 processor cycles.

Clearly, null prefetches are a bad idea. It would be nice if the CPU would simply ignore attempts to prefetch using a null pointer, but that's not how things are, so, as is often the case, one ends up trying to solve the problem in software instead. Ingo did some testing with a version of prefetch() which would only issue prefetch instructions for non-null pointers; that version did, indeed, perform better. But it still performed measurably worse than simply skipping the prefetching altogether.

CPU designers are well aware of the cost of waiting for memory; they have put a great deal of effort into minimizing that cost whenever possible. Among other things, contemporary CPUs have their own memory prefetch units which attempt to predict which memory will be wanted next and start the process of retrieving it early. One thing Ingo noticed in his tests is that, even without any software prefetch operations, the number of prefetch operations run by the CPU was about the same. So the hardware prefetcher was busy during this time - and it was doing a better job than the software at deciding what to fetch. Throwing explicit prefetch operations into the mix, it seems, just had the effect of interfering with what the hardware was trying to do.

Ingo summarized his results this way:

So the conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.

One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September. Chances are good that other prefetch operations will be removed as well. There will still be a place for prefetch() in the kernel, but only in specific situations where it can be clearly shown to help performance. As with other low-level optimizations (likely() comes to mind), tossing in a prefetch because it seems like it might help is often not the right thing to do.

One other lesson to be found in this experience is that numbers matter. Andi was right when he wanted to remove these operations, but he did not succeed in getting his patch merged. One could come up with a number of reasons why things went differently this time, starting with the fact that Linus took an interest in the problem. But it's also true that performance-oriented patches really need to come with numbers to show that they are achieving the desired effect; had Andi taken the time to quantify the impact of his change, he would have had a stronger case for merging it.

Index entries for this article
Kernel	Prefetch

prefetch and buffer bloat

Posted May 24, 2011 16:15 UTC (Tue) by grunch (subscriber, #16603) [Link] (9 responses)

This seems like a microscopic analog to the buffer bloat situation: trying to improve performance in a way that ultimately negates improvements the underlying handlers (hardware in this case, TCP in the case of buffer bloat) can already handle.

I wonder if it's a similar cause.

Thanks for the details Jonathan!

prefetch and buffer bloat

Posted May 24, 2011 16:22 UTC (Tue) by martinfick (subscriber, #4455) [Link] (8 responses)

Not really. The buffer bloat problem is due to the tradeoff between a marketing performance improvement in throughput versus latency which is rarely marketed or even considered (until now).

prefetch and buffer bloat

Posted May 24, 2011 16:50 UTC (Tue) by nye (subscriber, #51576) [Link] (2 responses)

I think that's an overly cynical viewpoint. It's likely that the great majority of cases of too large buffers are simply due to developers thinking "I don't know what's an appropriate value; let's just choose something big because it can't hurt and might help sometimes".

prefetch and buffer bloat

Posted May 24, 2011 17:11 UTC (Tue) by martinfick (subscriber, #4455) [Link]

While the marketing part might be cynical, I believe the tradeoff is important to recognize. In the case of bad prefetch, there is no tradeoff that I can see, it simply degrades the performance.

prefetch and buffer bloat

Posted Jun 2, 2011 21:54 UTC (Thu) by jch (guest, #51929) [Link]

Sorry if this is off topic for this discussion, but bufferbloat is not just a political issue -- it's a difficult technical one. Just reducing the size of buffers won't do, since the right amount of buffering depends on a lot of factors such as throughput, RTT, the transport- and application-layer protocol being used, etc.

Bufferbloat is not about reducing the amount of buffers in routers; it is about designing algorithms to make sure that routers only use as much of their buffers as necessary, and getting the router vendors to deploy such algorithms.

--jch

prefetch and buffer bloat

Posted May 24, 2011 17:27 UTC (Tue) by nicooo (guest, #69134) [Link] (3 responses)

I doubt it's a marketing issue. Having their current clients telling their potential clients that the service sucks can't be good for business.

prefetch and buffer bloat

Posted May 24, 2011 22:58 UTC (Tue) by Lennie (subscriber, #49641) [Link] (2 responses)

The problem is with the numbers that come out of the tests the manufactures do to show how great their hardware performs. Those get better with bigger buffers.

So again you have manufactures doing non-real-world test (which might have been a good test a long time ago) for marketing purposes and optimising for that case.

prefetch and buffer bloat

Posted May 31, 2011 20:43 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

> The problem is with the numbers that come out of the tests the manufactures do to show how great their hardware performs. Those get better with bigger buffers.

... up to a size after which throughput does not get better. Yet we can sometimes see buffer sizes way past this point (e.g. 1 second), which proves that some manufacturers do not bother trying to optimize anything at all.

prefetch and buffer bloat

Posted May 31, 2011 21:05 UTC (Tue) by dlang (guest, #313) [Link]

something to remember is that memory comes in standard sizes. it's not always possible/reasonable to put less ram in the device.

since they have the ram anyway, and buffers that are too small can cause problems. the logic then follows 'why not just use the ram in the device as a buffer'

this causes other problems, but these other problems were not well described until recently.

prefetch and buffer bloat

Posted May 24, 2011 17:53 UTC (Tue) by clugstj (subscriber, #4020) [Link]

Yes, it is marketing - whether from the marketing department or development engineering. You can't sell someone a faster ethernet card unless you can demonstrate higher throughput with no dropped packets, and the easiest way to do that is to increase the number of buffers.

The problem with prefetch on Old Processors (e.g. P3s)

Posted May 24, 2011 18:13 UTC (Tue) by ds2horner (subscriber, #13438) [Link] (1 responses)

Even on old Pentium IIIs removing the prefetch on the hash list is likely a win, however for these old machines that many users still have a need to continue to use there are likely few who will make the effort to determine if the prefetch is still a win for this old hardware.

Lack of hardware support for providing the metrics (like the counters used for the new hardware), and the long kernel compile times make it quite unlikely that those who would benefit from using prefetch will be able to provide the compelling statistics sufficient to have x86 family specific prefetch optimizations retained in the kernel.

And the majority of kernel developers with their powerful hardware have moved on...

I do appreciate that there are some (I did not record or remember names) who do regression testing on the old hardware. Perhaps one of these heroes will pick up the challenge.

As much as software prefetch has now fallen out of favour, I (perhaps naively) expected that, in the Pentium IIIs era, tests were run that demonstrated an improvement on that hardware.

hopefully Andi Kleen's approach will be accepted now.

Posted May 24, 2011 21:37 UTC (Tue) by ds2horner (subscriber, #13438) [Link]

I reread the article on Andi Kleen's patch.

His approach would leave the optimizations in place for those CPUs that could benefit; so K7 and others would have the option.

Thanks for the foresight Andi.

The problem with prefetch

Posted May 24, 2011 19:07 UTC (Tue) by alvieboy (guest, #51617) [Link] (12 responses)

Truth is we don't have faster computers due to economical reasons. If we did have proper RAM (SRAM) we would not need cache mostly (unless to use as FIFO, due to signal propagation delays).

Cache is a very complex thing - actually cache + MMU + branchpredictor + multimaster buses will eventually eat most of your chip, space-wise.

Now, to the subject:

Prefetching is useful when you actually know you'll be doing sequential accesses (still only if you do it on background). This is almost never true for the data cache, unless you're doing block copies. Unfortunately the way DMA engines are designed do not allow for easy off-processor copies, because set-up, check-end and tear down of the DMA engine cost too much (DMA can also work memory <-> memory, not only device <-> memory). Now, the cache controller cannot (not that I know of) fetch a cache line while other line is being fetched (DDR does not have multiple ports, and I don't think you can invalidate a cache line while it's being fetched). This means that you have to trust logic inside your cache manager, and *eventually* give it some hints of what you are trying to do. Technically speaking these "hints" may exist on some buses, like AMBA and Wishbone, so that latencies are mitigated by sequentially transferring large chunks. I don't know x86 internals enough to say if CPU<->cache has also a similar approach.

The biggest problem is the pipeline stalls while cache is filling up/writing back. I believe you can still use the pipeline + units while cache is being filled up, but again this is so slow (latency and speed-wise) compared to the processor speed that you'll eventually will need to access memory again, thus halting the processing.

CPU Chips could be 4x smaller if we had proper RAM (faster and multiport). This is the truth. Perhaps multidimensional chips will bring us this, but I sincerely doubt it.

Alvie

The problem with prefetch

Posted May 24, 2011 21:11 UTC (Tue) by dgm (subscriber, #49227) [Link]

Correct me if I'm wrong, but I think that fast SRAM is not only expensive, but also has to be very close to the processor core (ideally on the same die) to be effective.
Expensive and fast RAM close to the processor... hummm... isn't that the cache?

The problem with prefetch

Posted May 24, 2011 21:25 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

SRAM is not only expensive, it's also quite power-intensive. You'd have to cool it with liquid nitrogen.

That's why it's not used even on supercomputers.

The problem with prefetch

Posted May 24, 2011 23:48 UTC (Tue) by pphaneuf (guest, #23480) [Link] (5 responses)

The NEC SX-4 had 16 gigabytes of SRAM as its main memory. It didn't really have any CPU cache (only something like 32KB of instruction cache). That was a nice way of not having to deal with cache coherence issues in an SMP system.

It also had a 256 bytes wide memory bus (compared to the typical 64 bits).

Serious hardware, that. :-)

The problem with prefetch

Posted May 25, 2011 10:03 UTC (Wed) by kruemelmo (guest, #8279) [Link] (4 responses)

Now tell us something about power consumption as well please.

The problem with prefetch

Posted May 25, 2011 10:23 UTC (Wed) by dgm (subscriber, #49227) [Link]

And some performance numbers would also be appreciated.

The problem with prefetch

Posted May 25, 2011 11:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

http://www.ai.mit.edu/projects/aries/papers/vector/hammon...

Power consumption is 123KVA for 32-core version, performance is around 2GFLOPS per core for vector operations.

The problem with prefetch

Posted May 25, 2011 19:50 UTC (Wed) by wazoox (subscriber, #69624) [Link]

Impressive machine, not even puny by today's standard...

The problem with prefetch

Posted May 25, 2011 22:05 UTC (Wed) by pphaneuf (guest, #23480) [Link]

Yeah, I don't know about the power consumption, but while it would put out a good deal of heat, the SX-4 was the first of the SX series to be air-cooled. I'm not sure if the SX-3 used SRAM or DRAM (it was before my time), but that one was water-cooled.

While the vector performance was amazing, it was pretty sluggish for scalar stuff, so we didn't even use it for compiling, it was too slow.

Not saying that this is the end-all, be-all (the SX-5 traded in the SRAM for DRAM, and in exchange got 128 gigabytes of it, which was a whole lot in 1999 or 2000!), but just pointing out that it's been used on supercomputers (and I've used them), and it didn't require liquid nitrogen.

Here is a photo of one of the four SX-5 I helped look after.

The problem with prefetch

Posted May 25, 2011 11:49 UTC (Wed) by nye (subscriber, #51576) [Link] (3 responses)

>SRAM is not only expensive, it's also quite power-intensive. You'd have to cool it with liquid nitrogen

Wikipedia says "SRAM is more expensive, but faster and significantly less power hungry (especially idle) than DRAM. It is therefore used where either bandwidth or low power, or both, are principal considerations". This is also what they taught in my degree course; are you certain you're not confused?

The problem with prefetch

Posted May 25, 2011 12:58 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.

DRAM uses capacitors and only needs to be refreshed from time to time.

Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power.

The problem with prefetch

Posted May 25, 2011 13:49 UTC (Wed) by nye (subscriber, #51576) [Link]

>No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.

> DRAM uses capacitors and only needs to be refreshed from time to time.

All of the resources I can find describing SRAM state that the power used continuously while idle is trivial compared to the power needed to constantly refresh DRAM, and hence it's only during heavy utilisation that its power consumption can get *up to* that of DRAM.

>Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power

I'm sure what you're saying is true in your case, but how do you reconcile that with the fact that every other comparison between the two disagrees?

The only idea I can come up with is that you're comparing SRAM running *much faster* than DRAM so it's an apples-to-oranges comparison(?)

The problem with prefetch

Posted May 25, 2011 16:38 UTC (Wed) by bronson (subscriber, #4806) [Link]

> SRAM uses 6 transistors

Yep.

> and in idle mode continuously seeps energy

Not if made out of CMOS (and these days everything is CMOS). At steady state the only power loss in SRAM is gate and substrate leakage. Negligible.

> DRAM uses capacitors and only needs to be refreshed from time to time.

Yeah, 50 times a second. It adds up to quite a bit of power. Plus DRAMs have all the substrate leakage of the SRAM.

I'm pretty sure you're comparing supercomputer SRAMs from the early 80s, clocked to the limits of their lives, with low power modern DRAMs. If you compare equivalent parts you'll see that DRAMs are smaller but burn a lot more power.

The problem with prefetch

Posted May 24, 2011 21:18 UTC (Tue) by oak (guest, #2786) [Link] (5 responses)

"As with other low-level optimizations (likely() comes to mind)"

-> Is there an article about downsides of likely() usage?

likely()

Posted May 24, 2011 21:55 UTC (Tue) by corbet (editor, #1) [Link] (4 responses)

A quick look in the LWN kernel index turns up an article from 2004, an article from 2006, and one from last December.

likely()

Posted May 25, 2011 13:10 UTC (Wed) by oak (guest, #2786) [Link] (3 responses)

Thanks! So the issue with likely() was just people putting them to wrong places instead of un/likely() itself potentially slowing (things down that was half the issue with the explicit prefetch usage).

I typically use unlikely() in my code just to annotate ifs in error check&logging macros. If errors aren't unlikely, I think the performance of un/likely() is the least of my issues...

likely()

Posted May 25, 2011 15:30 UTC (Wed) by SLi (subscriber, #53131) [Link] (2 responses)

I have also seen in user-space code that using the GCC equivalent of likely() in branches where I knew by profiling were much more likely to be either taken or not taken actually slows down the code. I don't quite understand the reason for this. Then in other cases it does indeed cause a speedup. When fine-tuning performance I have had to try it branch by branch to get the best results, and even then I'm not sure it generalizes to other processor models.

likely()

Posted May 30, 2011 15:21 UTC (Mon) by berkus (guest, #74327) [Link]

-fprofile-arcs first!

likely()

Posted Jun 1, 2011 4:20 UTC (Wed) by AdamRichter (guest, #11129) [Link]

Sometimes you want to minimize latency for the less commonly used but more important branch, such as in almost any polling loop.

The problem with branch prediction

Posted May 24, 2011 21:43 UTC (Tue) by davecb (subscriber, #1574) [Link] (4 responses)

I saw a similar issue with prefetch and branch prediction back when I was doing a lot of SPARC work.

Branch prediction gave us a bit of extra performance with a few code bases, but the older and better the code, they less we saw. My favorite example is Samba, so a Smarter Colleague[tm] and I looked at what was actually happening. Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around. The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, we'd slow down whenever it wasn't trivial-to-predict straight-line code.

The better and older the code, the less we would get the next i-cache line sitting waiting for us, and the slower we'd run. Grungy straight-line FORTRAN benefited fine.

I don't recollect ever seeing an actual slowdown, but we rarely could see the predicted benefits from branch prediction.

I'd venture as much as a five cent bet we'll see the same with the intel architecture.

--dave

I think it's about some particular case of branch prediction...

Posted May 25, 2011 6:10 UTC (Wed) by khim (subscriber, #9252) [Link] (1 responses)

To fill the pipeline on contemporary CPU you need 30-50 instructions in flight. Without branch prediction it's just impossible to do. If you disable branch prediction on contemporary CPU the slowdown is crippling. Sadly only Intel engineers can give you numbers (because there are no way to disable it on retail CPU) - and they are not telling.

I think it's about some particular case of branch prediction...

Posted May 25, 2011 10:58 UTC (Wed) by mingo (subscriber, #31122) [Link]

It's relatively easy to measure the cost of branch misses in certain cases, such as using 'perf stat --repeat N' (the branch miss rate will be measured by default) and a testcase that uses a pseudo-RNG so it can run the same workload random and non-random and comparing the two.

And yes, missing branches is crippling to performance: a 3% branch miss rate can cause a 5% total execution slowdown and a 20% percent miss rate can already double the runtime of a workload. (!)

The problem with outguessing the CPU

Posted May 25, 2011 7:16 UTC (Wed) by alex (subscriber, #1355) [Link]

I have seen prefetch help on some architectures. When doing DBT stuff we would often look for places we could arrange the code to make it as efficient as possible. In the case of Itanium prefetch was a definite win as the architecture was structured to leave the hard stuff to the compiler. On x86 our experiments with instruction re-ordering and prefetch generally didn't yield much at all. The main difference being the x86 expends an awful lot of silicon in logic that attempts to predict all this behaviour for you. It's pretty good at it's job as well given how hard it was for us to squeeze extra out despite having a much better view of how the code was running than a compiler usually has.

The problem with branch prediction

Posted May 29, 2011 21:50 UTC (Sun) by giraffedata (guest, #1954) [Link]

Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around.

Why not? I can see there might not be any prefetching advantage because you're branching to something that is already in cache, but you can still do a lot of other execution of the instructions while still working on a prior one.

The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, ...

The line you'd hit is completely determined by the target in the branch instruction, isn't it?

The problem with prefetch

Posted May 25, 2011 6:16 UTC (Wed) by bronson (subscriber, #4806) [Link] (3 responses)

Reminds me of the register keyword. For a few years programmers put 'register' on just about any int that was used more than once... Turns out that in almost every case that was slowing things down. Compilers tend to chose better register variables than programmers.

The problem with prefetch

Posted May 25, 2011 21:18 UTC (Wed) by vonbrand (subscriber, #4458) [Link]

Somebody told me there are three kinds of C compilers: Dumb ones (they just disregard register), smart ones (they heed register if possible) and very smart ones (they asign registers better than whatever the programmer could hint at, and just overlook register). Current compilers are mostly in the third group.

The problem with prefetch

Posted Jun 2, 2011 13:17 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

I find a _reasonable_ use of "register" still useful for programmer to programmer communication anyway. It seldom hurts to express a (good) intent.

The problem with prefetch

Posted Jun 7, 2011 12:04 UTC (Tue) by etienne (guest, #25256) [Link]

Using "register" also means that the address of that number shall not be taken, so should have consequences on aliasing optimisation (you have a warning if modifying that code and take address of that variable, maybe this would have big difference in execution speed).

The problem with prefetch

Posted May 25, 2011 13:00 UTC (Wed) by rilder (guest, #59804) [Link] (1 responses)

Fair enough. However, I have a question regarding this. The whole change depends on hardware branch predictors which are not constant across hardware. I am not sure how well the branch predictors were pre-Nehalem/pre-Sandybridge, so if new kernels are used on slightly older hardware, won't they suffer from lack of both software/hardware prefetch hints ? I guess they could have made this a CONFIG_xxx option but may be that was infeasible/cluttering the code further.

Yes, will older processors be further penalized?

Posted May 25, 2011 16:26 UTC (Wed) by ds2horner (subscriber, #13438) [Link]

I believe your question more succinct than my attempt to raise the issue ( above).

In Andi Kleen's patch (LWN article linked to in the main article), he attempted to have the prefetch remain for list processing on CPUs (namely the K7 family) that would benefit from it.

His approach was to change the prefetch calls to list_prefetch calls and make the list_prefetch a no-op on most rchitectures (and mapping to prefetch via CONFIG_LIST_PREFETCH for MK7 only (presumably with more to be added if they would benefit).

But my main question was, other than the K7 where there is historical "evidence" that this CPU benefited, who would now do the justification for other older processors (like the P3s)?

The problem with prefetch

Posted May 25, 2011 17:57 UTC (Wed) by branden (guest, #7029) [Link]

> One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September.

"Sometimes, getting your patch in is just a matter of waiting for somebody else to reimplement it." -- Jonathan Corbet, https://lwn.net/Articles/51615/

The problem with prefetch

Posted May 25, 2011 23:13 UTC (Wed) by dashesy (guest, #74652) [Link]

I started reading kernel code from the "list.h", where I saw some beautiful code, and constructs (i.e. container_of), and also the prefetch mechanism.
So it was useful for me at least :)

The problem with prefetch

Posted May 26, 2011 0:49 UTC (Thu) by cesarb (subscriber, #6266) [Link]

I think I am seeing a common thread between this article and the recent undefined behavior article.

The other article: the compiler will do things you did not expect.

This article: the hardware will do things you did not expect.

HW prefetcher is smarter than you think

Posted May 26, 2011 1:22 UTC (Thu) by csd (subscriber, #66784) [Link]

Kudos to the HW designers. A few years back when the first Opterons came out, I did a lot of perf comparisons of code using 'prefetch' vs not using it, using various strides of walks forwards and backwards in arrays. If I recall the results correctly, only backwards walks with varying offsets would benefit from a 'prefetch' being added (not greatly though), while in all other cases having manual prefetch either made no difference or was slower than the leave-it-to-hardware-prefetch case.
Needless to say, we discarded any notion of using prefetch at the time.

The problem with prefetch

Posted May 26, 2011 2:26 UTC (Thu) by darthscsi (guest, #8111) [Link]

Prefetch helps mainly if you can issue far enough in advance of a reference that the reference sees a noticeable reduction in miss penalty (Likewise, prefetching likely hits wastes issue slots). In this code, maybe 3 cpu cycles (generously, but could be measured) are spent between the prefetch and the reference. This is insignificant compared to an L2 cache miss.

There is a large literature on how far in advance prefetches need to be to do any good.

The problem with prefetch

Posted May 26, 2011 16:38 UTC (Thu) by RogerOdle (guest, #60791) [Link]

Is there information about why this is happening? I do a lot of embedded work and cache control is a big deal. Are there any metrics showing the rates of cache stalls and how these are effecting the measurements?

I have not used Linux extensively in embedded in the past but the recent changes in the 2.6.39 kernel that bring more real time support make linux even more attractive in this area. One thing that commercial RTOS's allow is for applications to lock portions of the cache to hold highly reused code like DSP algorithms. I have not looked into how this is done in Linux or if it can be done at all. But partitioning control of the cache is an important feature in a small set of performance sensitive applications.

It has not been possible to use Linux for some of the applications I have been involved with in the past because of latency issues but Linux is constantly changing and 2.6.39 opens up possibilities that were out of reach before.

The problem with prefetch

Posted Jun 6, 2011 6:52 UTC (Mon) by hyoshiok (guest, #75469) [Link]

prefetch makes cache pollution. it is well known issue.

The problem with prefetch

Posted Jun 16, 2011 8:32 UTC (Thu) by tcucinotta (guest, #69261) [Link]

> Ingo summarized his results this way:
>
> So the conclusion is: prefetches are absolutely toxic,
> even if the NULL ones are excluded.

If I understand well, the main source of the "toxicity" here is the lists being (most of the cases very) short in the case of their usage within hash tables. So, if one wanted to keep some highly reusable/useful helper macro(s) to iterate items, perhaps it could be worth to have 2 versions of them, one for likely long lists, the other one for likely short lists.
However, I also suspect that, whilst for relatively "empty" hash tables those lists will likely contain only 1 item, for highly "full" tables those lists can easily become more and more crowded, so the developer's hint (likely short vs likely long) would easily be workload/scenario-dependent.
So, the final question would be whether there's a scenario to be considered "more likely" (or worth to be optimized) by developers than others.
Just my 2 cents.

You just need to know and remember in your local brain cache this prefetch problem

Posted Jul 12, 2017 10:16 UTC (Wed) by kazan417 (guest, #117591) [Link]

Thanks for sharing this useful information!
Again unobvious tecchnology behavior, which you just need to know and remember in your local brain cache.
And we have many of them(unobvious behavior) every day in modern technology, because now they are super complex