Denial of service via hash collisions

By Jake Edge
January 11, 2012

Man pages (or perldoc pages in this case) are not the usual places to look for security holes, but that is apparently where two researchers found a whole class of problems for web application frameworks (and the languages underlying them). The problem concerns dictionaries (aka hashes, depending on the language) and was fixed in Perl in 2003, but the need for a fix evidently wasn't clear to other high-level/scripting language developers. So, Julian Wälde and Alexander Klink were able to report efficient denial-of-service attacks against PHP, Python, Java, Ruby, and the v8 JavaScript engine via some of the web frameworks that use those languages at the recently held Chaos Communication Congress.

The "perlsec" perldoc page has a section on "Algorithmic Complexity Attacks" that explains the problem that was fixed in Perl and is (or was) present in other languages. Dictionaries are data structures offered by some languages to store key/value pairs, which are implemented using a hash table based on the key. Unlike a cryptographic hash, the hash functions that are used for hash tables are meant to be efficient in terms of CPU usage, but still offer collision resistance. But, if the same hash function is used for every installation and invocation of a given language, one can fairly easily calculate a set of dictionary keys that collide. That's where the problem starts.

A hash table needs to have some way of dealing with collisions, where two different key values hash to the same value. One common way is to have each hash "bucket" be a linked list of all the dictionary entries that hash to the bucket index. In most cases, that works just fine as the number of collisions is typically small, so the slight inefficiency of traversing the list for insertion, deletion, and searching is in the noise. But if an attacker can control the keys that are being inserted, they can arrange that all of them hash to the same bucket. So, that turns the usual case of an insertion requiring the traversal of a few collisions, at worst, to an insertion requiring traversal of all entries added so far.

If all keys hash to the same bucket, the entire linked list for the bucket must be traversed in order to insert a new key. During the traversal, each stored key must be compared with the insertion key to see if it is already present (which it wouldn't be in the attack scenario, but the hash table insertion code doesn't know that). As more and more keys get added, that traversal takes longer and longer, to a point where it can chew up some fairly significant CPU time. But how can an attacker control the key values that get stored?

That's where web frameworks come into play. Those frameworks helpfully collect up all of the POST form data that gets sent to a particular web page into a dictionary that gets handed off to the application for processing. The amount of data that the frameworks will allow is typically rather large (e.g. 1MB), so an attacker can create tens of thousands of form entries in a single HTTP request. Because they all collide, those entries can take tens of seconds to a few minutes to process on an i7 core according to the advisory [PDF] released by Wälde and Klink.

Because the languages do not randomize the hash function used in any way, the collisions that get calculated are usable for a wide variety of web applications. It is likely that collisions calculated for Python would work on most Python-based web applications for example. The presentation (slides [PDF]) mentions two different techniques to find collisions that are computationally feasible, but they are both "offline" calculations and cannot be used if the hash function changes between invocations and sites. Thus, the suggested fix for the problem is to choose a random seed that gets mixed into the hash so that collisions are not predictable. That seed would be chosen at language start-up time so that each invocation essentially had its own hash function.

There are workarounds for the problem as well. Web frameworks could (probably should) limit the amount of form data that can be processed and/or the number of form elements that are allowed in a given POST. Another potential workaround is to limit the amount of CPU time that a web application is allowed to use. If a given POST can only soak up ten seconds of a core, for example, it will take more of them (i.e. more bandwidth) to fully saturate a given server or set of servers.

While randomizing the hash function is suggested, it may not be a complete solution. Discussion on the python-dev mailing list and in Python bug 13703 indicate that more is needed to eradicate the problem. As Paul McMillan put it: "I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack." For long-running Python interpreters, it may be feasible for an attacker to brute force or otherwise determine the random seed that was used.

Determining the random seed will be easier to do with an "oracle function", which is some mechanism that an attacker can use to get information about the hash values of various strings. It does not require that the "raw" hash values be returned, necessarily, as other information (like JSON dictionary key ordering, for example) can be used to deduce the seed. Those are much harder attacks than the simple "POST with thousands of collisions" attack that is available today, however.

Other languages, including Perl and Ruby have fixed the simpler problem, but Python is looking at going further. One way to do that would be to use a much larger seed, and choose pieces of the seed to add into the hash based on properties of the input string, which essentially increases the search space for brute force and other means of deducing the seed.

PHP has also started to consider fixes. According to the advisory, Oracle has determined that no fix will be applied to Java itself, but that the GlassFish application server will be updated to fix or work around the problem.

The 2003 research [PDF] by Scott A. Crosby and Dan S. Wallach clearly showed the flaw, but didn't apply it to any other language beyond Perl. That led to the Perl fix, but even a heads-up message to python-dev by Crosby was not enough to cause a fix in Python at that time. Part of the problem was the inability to point to an easy way for an attacker to control the keys put into a dictionary. Evidently, the web frameworks were not considered at that time.

According to McMillan, though, the web frameworks are just the tip of the iceberg in terms of the scope of the problem. Various Python developers had suggested that the problem should be fixed in the web frameworks (as Java is evidently doing) or in a few standard libraries (like urllib, cgi, and json), but McMillan states that the problem goes much further than that:

A short and very incomplete list of vulnerable standard lib modules includes: every single parsing library (json, xml, html, plus all the third party libraries that do that), all of numpy (because it processes data which probably came from a user [yes, integers can trigger the vulnerability]), difflib, the math module, most database adaptors, anything that parses metadata (including commonly used third party libs like PIL), the tarfile lib along with other compressed format handlers, the csv module, robotparser, plistlib, argparse, pretty much everything under the heading of "18. Internet Data Handling" (email, mailbox, mimetypes, etc.), "19. Structured Markup Processing Tools", "20. Internet Protocols and Support", "21. Multimedia Services", "22. Internationalization", TKinter, and all the os calls that handle filenames. The list is impossibly large, even if we completely ignore user code. This MUST be fixed at a language level.

Work on testing and benchmarking changes for Python is ongoing, but the plan is to provide a security release for 2.6, 2.7, and 3.3, while also providing backported patches for earlier Pythons. PHP is discussing options, including placing limits on the length of hash collision chains and limiting the number of form variables that will be processed, while still planning to add randomization into the hashing function.

It is interesting to see a nearly ten-year-old flaw pop back up, though it seems likely that the flaw hasn't been exploited widely (or fixes would have come sooner). It does serve as a reminder that what may seem like theoretical vulnerabilities will, over time, often become actual vulnerabilities. It's sometimes hard to consider making changes for "theoretical" attacks, which may well be the right choice at the time, but it is worth pondering the likelihood that eventually the change will need to be made. Like most security questions—as with the rest of software development—it's always a matter of trade-offs.

Index entries for this article
Security	Vulnerabilities/Denial of service
Security	Web frameworks

to post comments

Worst case performance and "impossibly unlikely" conditions

Posted Jan 11, 2012 21:45 UTC (Wed) by epa (subscriber, #39769) [Link] (46 responses)

In programming we rarely worry about worst-case performance of an algorithm. We are happy to take the code that has good average-case performance over the code which is usually ten times slower but has fewer pathological cases. For example, almost every scripting language implements dictionaries using a hash function instead of some kind of binary tree. The slower ordered-tree based dictionary implementations are confined to "heavier" languages such as C++; the longer the programmer's beard, the better the chance that worst-case performance has been considered carefully.

Except for safety-critical systems, this has usually been seen as okay. But when you have an attacker deliberately setting out to break your code, the worst case becomes more important. You can no longer appeal to luck to rule out "incredibly unlikely" collisions or coincidences, unless you have ensured that any attacker won't be able to twist fortune in his favour.

Another example is conservative garbage collections for languages like C where the runtime does not have special support for GC. Most of the time it is very unlikely that you will confuse the garbage collector by having binary data which looks like pointers to objects on the heap (especially on a 64-bit system). But if you take data from the big wide world, it's quite plausible that an attacker could DoS your application by feeding it binary data that confuses the GC. Again the approach works well enough in practice, until it doesn't.

Does anyone else have some more examples?

Worst case performance and "impossibly unlikely" conditions

Posted Jan 11, 2012 22:01 UTC (Wed) by vegard (subscriber, #52330) [Link]

Sorting used to be a problem when people were still using quicksort on attacker-controllable input.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 11, 2012 22:16 UTC (Wed) by rgmoore (✭ supporter ✭, #75) [Link] (34 responses)

In programming we rarely worry about worst-case performance of an algorithm. We are happy to take the code that has good average-case performance over the code which is usually ten times slower but has fewer pathological cases

Not always. A good example is Perl's decision to switch its default sorting algorithm from quicksort to merge sort. Quicksort is O(NlogN) for the general case but O(N**2) for some pathological inputs, while the merge sort they chose has worst case of O(NlogN) but has worse benchmark performance on non-pathological data. In that case, it turned out that the pathological case (IIRC it was pre-sorted data) was common enough that it was worse paying a little bit more in the average case to avoid it. OTOH, that was just changing the default; you can still force it to use quicksort if you're confident your data won't be pathological.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 11, 2012 23:05 UTC (Wed) by rbrito (guest, #66188) [Link] (9 responses)

I have not yet read the code, but I would suspect that mergesort would not be an in-place sorting algorithm (meaning that it would cost a lot of memory, besides the recursion stack).

Are there are obvious reasons why Perl has not chosen Heapsort instead of Mergesort if we are considering "worst case performance is a security bug"?

Frankly, as someone coming from a theoretical background, I find this "security bug" to be a non-bug at all...

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 0:07 UTC (Thu) by wahern (subscriber, #37304) [Link]

Perl didn't select merge sort because of security concerns, AFAIK. Merge sort was made the default with 5.8.0, which was released about a year before the paper on attacking Perl's hash.

And of course coming from a theoretical background algorithmic complexity attacks don't matter to you. You don't even think languages need interfaces like read() or write(). The data you operate on just mysteriously exists inside your program. And on your really good days, you don't even need complex data structures; a list is all you need. :P

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 1:49 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (2 responses)

Perl switched sort implementations because they saw bad performance as a performance problem, not because they saw poor performance as a possible DOS problem. It turns out that the oddball pathological data isn't so oddball after all, so performance regressions because of their sort implementation were more common than you'd expect if you assume that you're always sorting random data. They also let the user tweak the sort algorithm with a pragma, so people who really know what they're doing can override the default and use quicksort if they know it won't cause problems.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 2:08 UTC (Thu) by rbrito (guest, #66188) [Link] (1 responses)

> Perl switched sort implementations because they saw bad performance as a performance problem, not because they saw poor performance as a possible DOS problem.

Thanks for pointing it out.

The article does (at least to my non-native English-speaker interpretation) give the impression that the "bad performance is DOS security problem" was the case for the change.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 3:42 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

I think I may have inadvertently confused you. The sort implementation I'm talking about is a different example than the hash implementation mentioned in the parent article, where the performance regression is only likely to be important in the context of a deliberate DOS attack. Perl sorting is just a real world example of an algorithm encountering pathological data far more often than you might naively expect. Practical experience with pathological data forced them to start considering worst-case performance rather than just average case.

FWIW, the Perl documentation also points out that merge sort uses fewer comparisons than quicksort in the typical case (and no more in the worst case), so it may be faster if the comparisons themselves are expensive. One more example of why it's important to have several algorithms with different and well understood performance characteristics for important tasks.

Heapsort is useless in real world...

Posted Jan 12, 2012 9:32 UTC (Thu) by khim (subscriber, #9252) [Link] (4 responses)

Heapsort is only good for things like priority_queue. The fact that it does not use additional memory looks good on paper but in reality it's red herring. The problem with heapsort is that it uses pathological memory access patters which make it literally hundreds of times slower once your heap is larger then your L3 cache (end it's significantly slower then merge sort when you overflow L1).

This means that there are only two cases where heapsort is good:
1. When data is small enough to fit in L1 cache.
2. When you have just enough data to fit in memory (L2 cache, L3 cache, etc), but there are not enough memory to keep second copy.

In first case mergesort is worse but still acceptable and second case is extremely rare (unless your program is designed to behave this way) thus heapsort is obviously very poor default.

Heapsort is useless in real world...

Posted Jan 20, 2012 1:45 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

I'd say heapsort is also a very good choice if you want to process data in sorted order. Ie "get first item, process, get next item, process, rinse, repeat".

But there you don't care too much about time and "pathological" cases, because sorting is a minor part of your costs.

Cheers,
Wol

Heapsort is useless in real world...

Posted Jan 20, 2012 7:55 UTC (Fri) by ekj (guest, #1524) [Link] (2 responses)

If the "process data" step is constant time - i.e. the work performed to process one item is independent of how many items exist, then that's O(n).

If the pathological sort-case is O(n^2), then no matter how large the constant separating these two cases is, for some amounts of items, the sort will dominate.

You might have a "process data" step that takes 50ms, and a sort that you expect to take 0.2ms * n * log2(n) and conclude that processing a million items, should take ~21 minutes. (about 20 minutes for processing, 1 minute for sorting)

If pathological data causes that sort to instead take 0.2ms * n * n time, then you're looking at 20 minutes of processing, and 6 years of sorting, which is clearly likely to be a DOS.

Forcing n^2 where n*log2(n) is expected is a huge deal for largish data-sets (a million items is large, but not gargantuan), it ups the workload by a factor of 50000. Thus even if your sort *was* a small fraction of the expected work, it's now dominant.

Heapsort is useless in real world...

Posted Jan 20, 2012 22:45 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

Except that, is there a pathological case for heap sort? It's actually a very well behaved sort. Setting up the heap is the most unpredictable bit, and the most pathological case (an already-sorted array) isn't that much worse than a random array.

And once the heap is created, it's completely predictable. For a heap of 2^x items, it will require x comparisons and x swaps to get the next value. (Give or take an off-by-one error on my part :-)

Cheers,
Wol

It was already explained...

Posted Jan 22, 2012 0:54 UTC (Sun) by khim (subscriber, #9252) [Link]

Except that, is there a pathological case for heap sort?

Sometimes it's good idea to actually look up further upthread.

It's actually a very well behaved sort.

No, it's absolutely not. It's quite good WRT to comparisons and swaps, but it's absolute disaster WRT to memory access. If you data fit in L1 cache (16K-32K) it works just fine, if you hit L2 and/or L3 cache you start doing worse then merge sort or quick sort, but difference is tolerable (typically less then 2x). The real disaster happens once you hit main memory: to read one number from random address in memory you need about 250-300 ticks (and this is what happens with heapsort because prefetch can not cope with it's memory access patterns), while sustained sequential access gives you next 32bit number in 2-3 ticks. Since some part of heap will be in cache even for large arrays heapsort does not suddenly become 50-100 times slower then mergesort when you pass L3 border, to get to this level of difference you need hundreds of gigabytes of data.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 21:13 UTC (Thu) by HelloWorld (guest, #56129) [Link] (22 responses)

Well, Quicksort only has a quadratic run-time if you chose the pivot element unwisely. But there are algorithms that can chose the median (and hence the optimal pivot element) from a list in O(n). I wonder how Quicksort would compare when such a strategy is used.

By the way, pre-sorted data is only a problem for Quicksort if you chose the first or the last element of the list as pivot, which is why typical implementations don't do that.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 6:11 UTC (Fri) by cmccabe (guest, #60281) [Link] (21 responses)

> But there are algorithms that can chose the median (and hence the
> optimal pivot element) from a list in O(n). I wonder how Quicksort
> would compare when such a strategy is used.

You're thinking of quickselect. Quickselect is just a degenerate version of quicksort where you only recurse into one partition, since you know it contains the median and the other does not. Although it is O(n) on average, quickselect has the same problem as quicksort. You can cause it to become quadratic with carefully crafted input. So it doesn't buy you anything here, really.

There have been a lot of pivot selection heuristics designed over the years. But really, if you're even going to use quicksort any more, you're probably better off choosing a random pivot than burning cycles on a fancy (and probably cache-unfriendly) heuristic. Just make sure attackers can't predict your "random" decisions.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 10:33 UTC (Fri) by HelloWorld (guest, #56129) [Link] (17 responses)

> You're thinking of quickselect.
There are other algorithms (such as the "median of medians" algorithm) that have a worst-case performance of O(n). Anyway, choosing a random pivot element is probably good enough.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 21:01 UTC (Fri) by cmccabe (guest, #60281) [Link] (16 responses)

The median of medians, though possibly a useful thing, is not the same thing as the median.

In your post you said:
> But there are algorithms that can chose the median
> (and hence the optimal pivot element) from a list in O(n).

So that's what I was responding to.

Where this hubris comes from ?

Posted Jan 13, 2012 23:34 UTC (Fri) by khim (subscriber, #9252) [Link] (15 responses)

The median of medians, though possibly a useful thing, is not the same thing as the median.

Un·be·liev·a·ble. STFW! The first result gives you link to Median of Medians algorithm - and yes, this algorithm can be used to find median in linear time.

So that's what I was responding to.

The only thing you response shows is that you just refuse to think. First you were given hint: it's possible to find median in linear time (already enough to find the answer). The you were given then name of the algorithm. Yet still you continue to persist in your ignorance. Perhaps direct link will convince you?

Can we, please, close this discussion? Yes, it's possible to find median in linear time and yes, you can make quicksort O(N log N) in the worst case using this approach. No, this is usually not feasible because it adds so many manipulations to quicksort that it stops being quick - if you really wonder about worst case scenarios there are different algorithms.

Where this hubris comes from ?

Posted Jan 14, 2012 0:08 UTC (Sat) by nybble41 (subscriber, #55106) [Link] (14 responses)

> The first result gives you link to Median of Medians algorithm - and yes, this algorithm can be used to find median in linear time.

The result of the algorithm is, as the name implies, the median of a list of subgroup medians. It is not the same as the median of the entire list. Given the group size of five used in the article, the result will split the list somewhere between 30% and 70%, rather than at 50% as with a true median. So thus far the GP is entirely correct.

Despite that, however, it can be used to guarantee O(n) worst-case performance for the quickselect algorithm, and thus O(n log n) for quicksort.

Where this hubris comes from ?

Posted Jan 14, 2012 1:13 UTC (Sat) by HelloWorld (guest, #56129) [Link]

> The result of the algorithm is, as the name implies, the median of a list of subgroup medians.
No dude, it's not.

What happened?

Posted Jan 14, 2012 11:32 UTC (Sat) by khim (subscriber, #9252) [Link] (12 responses)

This is just sad. LWN was a place where you were able to come and discuss things easily. Without trying to do something with mass of twitter generation guys who have eyes connected directly to hands (perhaps via spinal cord but definitely bypassing brain) and/or who reply without thinking (this makes brain pointless: it's there but it does something unrelated to commenting process).

You both (cmccabe and nybble41) say that algorithm picks Median of Medians - and this is true. Now the question: how can it do that if that's the end result? If it uses some another algorithm to select true median from medians - then what's the point? Answer (obvious and true): yes, it does select "Median of Medians" (hence the name), but no, this is not the end result. The end result is true median.

I honestly don't know what to do with people of "twitter generation". When you are employer you can just fire them and refuse to hire new ones. When you are open site... they tend to clog all discussions and make them hard to follow.

So far LWN was place which happily avoided this fate (perhaps because it used subscription model), but cmccabe and nybble41 are subscribers thus it's obviously not a panacea.

Guys. Please, please, please don't write on LWN if you don't have time to think. Everyone does mistakes (phrase errare humanum est, … is well-known and loved), but somehow people start forgetting the second part of that some phrase (perseverare diabolicum). Be human! PLEASE!!!

What happened?

Posted Jan 14, 2012 21:45 UTC (Sat) by cmccabe (guest, #60281) [Link] (11 responses)

Hmm. I didn't expect this to be so controversial.

Here is a quick example of median of medians.

#!/usr/bin/python

def find_median(list):
    n = len(list)
    list2 = sorted(list)
    if ((n % 2) == 0):
        return (list2[int(n/2)] + list2[int(n/2) + 1]) / 2
    else:
        return list2[int(n/2)]
def median_of_medians(k, list):
    n = len(list)
    if (n == k):
        return find_median(list)
    m = []
    for i in range(0, n, k):
        m.append(median_of_medians(k, list[i:i+k]))
    return find_median(m)
list = [3, 1, 4, 4, 5, 9, 9, 8, 2]
print find_median(list)
print median_of_medians(3, list)

You can see that the median of our example list is 4, but the "median of medians" (with k=3) is 5. The median of medians is not the same as the median.

However, you can use median-of-medians as the pivot selection heuristic for quickselect. If you do this, you are guaranteed O(n) worst case running time for quickselect.

So yes, in theory, you could use quicksort with quickselect as the pivot selection algorithm, and median-of-medians as the pivot selection algorithm for quickselect. And it would all be N log N. However, the constant factors would be horrendous. You'd basically be doing 3x as much work as the guy who just used merge sort. That's why the Java guys went with merge sort.

If you want to know even more, check out: http://martinsjava.blogspot.com/2009/03/test-this-is-code-more-code-end-test.html.

TL;DR: it's easy to code an in-place quicksort, and it's still useful for some things. However, it can't be made adversary-proof without losing a lot of its real-world performance advantage over other N log N sorts.

Are you moron or just play one on TV?

Posted Jan 14, 2012 23:39 UTC (Sat) by khim (subscriber, #9252) [Link] (10 responses)

Hmm. I didn't expect this to be so controversial.

There are no controversy. Just pure obstinacy.

Here is a quick example of median of medians.

Let's see.

...

list2 = sorted(list)

...

Brilliant idea. Not. This single line guarantees that your median_of_medians function will need at least O(N log N) operations in some cases. This, in turn, will probably mean that your “outer sort algorithm” will have at least O(N log²N) complexity, not desired O(N log N). If you'll use it recursively then situation can be even worse (not sure if you can push it all the way to O(N²): for that to happen you'll need to hit “worst case” often enough and I'm not 100% sure it's possible to organize).

Let me remind you the story of whole so called “controversy”:
HelloWorld: there are algorithms that can chose the median (and hence the optimal pivot element) from a list in O(n) ⇨ TRUTH
cmccabe: You're thinking of quickselect ⇨ NONSENSE¹)
HelloWorld: There are other algorithms (such as the "median of medians" algorithm) that have a worst-case performance of O(n) ⇨ TRUTH
cmccabe: The median of medians, though possibly a useful thing, is not the same thing as the median ⇨ IRRELEVANT²)
khim: The first result gives you link to Median of Medians algorithm - and yes, this algorithm can be used to find median in linear time ⇨ TRUTH
nybble41: The result of the algorithm is, as the name implies, the median of a list of subgroup medians ⇨ NONSENSE³)
khim: The end result is true median ⇨ TRUTH⁴)
cmccabe: However, you can use median-of-medians as the pivot selection heuristic for quickselect. If you do this, you are guaranteed O(n) worst case running time for quickselect ⇨ NONSENSE⁵)

As you can see the whole controversy is kind of trivial: one side postulates true (and relevant!) facts while other side either says incorrect assertions or correct yet entirely irrelevant facts.

I'm not sure why it's so hard for you to accept the fact that Median of Medians algorithm's goal is not to find median of medians (it uses median of medians as intermediate step) - but, well, that's the root of the whole controversy. If you'll stop feeling indignant for just a moment and spent five minutes to actually take a look on Median of Medians algorithm (Wikipedia does pretty good job explaining it) then suddenly the while discussion will stop being controversial.

If you want to know even more, check out: http://martinsjava.blogspot.com/2009/03/test-this-is-code-more-code-end-test.html

AFAICS this article tells nothing about median selection - it just points out that “median of three” can not save quicksort. Which was kind of expected from the beginning but it's still nice to have formal proof.

That's why the Java guys went with merge sort.

Another fail. Java guys have not “went with merge sort”. They actually use both tuned quicksort and merge sort. Apparently they felt it'll be good idea to spice life of a programmer. Need to sort Int[] ? Sure, be my guest - O(N log N) is guaranteed. Managed to switch int[] to save on allocations? Have a nice surprise of O(N²) sort! Otherwise life will be dull, you know...

──────────
¹) Obviously here HelloWorld does not talk about quickselect because quickselect does not guarantee O(N) complexity.
²) Here Median of Medians algorithm (100% relevant because it can be used to find median in O(N) time) somehow transformed to median of medians (which is of course entirely different thing).
³) The same story again: Median of Medians algorithm phrase can fit in twitterbrain, linear algorithm for the general case of selecting the kth largest element was published (which is explanation if what the said algorithm does) overflows it. We are Twitter generation! 100 character, 14 words, 1 click? NOOOOOO! That's too long! That's too hard!
⁴) Here I've tried to simplify algorithm's goal description to fit them in twitterbrain - apparently without any success
⁵) This particular “find median of medians” algorithm requires at least O(N log N) operations which gives us at least O(N log²N) complexity. Close, but no cigar.

Are you moron or just play one on TV?

Posted Jan 15, 2012 3:30 UTC (Sun) by nybble41 (subscriber, #55106) [Link] (2 responses)

People might tend to take you more seriously if you refrained from resorting to personal insults. Yes, the cited section of the Wikipedia article does start out with the assertion that the algorithm can give the median in linear time. However, without already knowing the answer, that is not at all apparent from the remainder of the description. Your opponents are not "twitterbrains" just because they expect more authoritative citations than a single, rather dense, Wikipedia article without so much as a single hyperlink for further reading.

However, after rereading this thread, the Wikipedia article, several other descriptions of the Median of Medians selection algorithm (several of which explicitly supported the 30%/70% argument), and finally the original article which I was able to track down via Google Scholar[1], I am forced to agree that this algorithm can be used (in combination with the quickselect algorithm) to find the true median in linear time.

Note that the Median of Medians algorithm, on its own, does in fact return a result between 30% and 70% of the list, not the true median at 50%. Only by using it to select the pivot point for the quickselect algorithm do you get the true median of the list in linear time.

[1] ftp://reports.stanford.edu/www/pub/public_html/public_htm...

Read "How To Ask Questions The Smart Way", please...

Posted Jan 15, 2012 11:42 UTC (Sun) by khim (subscriber, #9252) [Link] (1 responses)

People might tend to take you more seriously if you refrained from resorting to personal insults.

If people expect to be spoon-feed then it's their problem, not mine. I remember ages-old principle errare humanum est, perseverare diabolicum very well, thank you very much. I'm human, after all - and sometimes make mistakes. This is why I try to give hints without insults at first. But after few rounds it's time to recall the second part of principle and recall that what we are, unapologetically, is hostile to people who seem to be unwilling to think or to do their own homework before asking questions.

Yes, the cited section of the Wikipedia article does start out with the assertion that the algorithm can give the median in linear time. However, without already knowing the answer, that is not at all apparent from the remainder of the description.

Well, you actually need to read Wikipedia's article (in particular it's good idea to not skip Proof of O(n) running time and Important notes parts which state quite bluntly that the worst-case algorithm can construct a worst-case O(n log n) quicksort algorithm, by using it to find the median at every step) and think. Is it such a big problem?

Your opponents are not "twitterbrains" just because they expect more authoritative citations…

Yes, they are. This is the very definition of twitterbrain: the person who expect citations. They don't seek hints, they are not interested in proof, they just want authoritative citations to spread them around without thinking.

than a single, rather dense, Wikipedia article without so much as a single hyperlink for further reading.

This “single, rather dense, Wikipedia article” includes:
1. Explanation of the algorithm.
2. Proof of the O(N) complexity.
3. Direct explanation of how the algorithm can be used in quicksort.
More then enough for anyone with more than half a brain. If you don't trust wikipedia then you can just check the included proof - but apparently this is beyond abilities of “twitter generation”.

Note that the Median of Medians algorithm, on its own, does in fact return a result between 30% and 70% of the list, not the true median at 50%.

And now we are back to square one. I know, I know, the temptation to stop thinking and start clicking and copy-pasting is almost irresistible. But please stop for a moment and think: just how the h*ll median of medians algorithm works? It needs to find true median in a smaller (N/5) array. How can it do it if it does not refer any other algorithm and if it can not find median itself? Answer is simple: "median of medians" is intermediate step which guarantees that recursive call will have less then 70% of elements to deal with. This is where T(n/5) + O(n) in T(n) ≤ T(n/5) + T(7n/10) + O(n) formula comes from. But this is not the end result! The end result is kth largest element (and median is, of course, N/2th largest element thus it can be produced by algorithm directly). This is where T(7n/10) in the aforementioned formula comes from.

Is it so hard to understand?

Lowering the bar and hitting people with it

Posted Jan 15, 2012 14:13 UTC (Sun) by man_ls (guest, #15091) [Link]

I'm so glad khim plonked me...

Are you moron or just play one on TV?

Posted Jan 15, 2012 8:18 UTC (Sun) by cmccabe (guest, #60281) [Link] (6 responses)

> Let's see.
>
> > ...
> > list2 = sorted(list)
> > ...
> Brilliant idea. Not. This single line guarantees that your
> median_of_medians function will need at least O(N log N) operations in
> some cases. This, in turn, will probably mean that your "outer sort
> algorithm" will have at least O(N log²N) complexity, not desired O(N log
> N). If you'll use it recursively then situation can be even worse (not
> sure if you can push it all the way to O(N²): for that to happen you'll
> need to hit “worst case” often enough and I'm not 100% sure it's possible
> to organize).

It is true that I could have implemented find_median more efficiently. However, this does not affect the overall time complexity of the algorithm. We never find the median of more than 3 numbers at a time.

In other words, we are doing O(3 log 3) work instead of O(3) work at each recursion. But O(3 log 3) and O(3) are both still O(1).

As for the rest of what you wrote-- yes, it looks like the Java situation is more complex than I made it out to be. They do still use some form of quicksort for primitive data types-- probably to get the benefits of in-place sorting.

I don't understand the other stuff you wrote at all. As far as I know, we all agree on the fact that O(N log N) worst-case quicksort is possible, O(N) quickselect is possible, and we all understand how median-of-medians works. What exactly are we in disagreement about, if anything?

By the way, I am probably not going to continue posting to this thread unless someone posts some math that's clearly wrong-- like the above confusion about constant factors.

And the saga continiues!

Posted Jan 15, 2012 11:42 UTC (Sun) by khim (subscriber, #9252) [Link] (5 responses)

It is true that I could have implemented find_median more efficiently. However, this does not affect the overall time complexity of the algorithm. We never find the median of more than 3 numbers at a time.

Just how many times can you say gibberish without checking facts? Here is your call which processes more then 3 numbers at a time:

    return find_median(m)

Here size of m is N/3 and O(N/3 log(N/3)) is O(N log N), not O(N) or O(1), sorry.

As far as I know, we all agree on the fact that O(N log N) worst-case quicksort is possible,

True.

O(N) quickselect is possible

This remains to be seen. We can use median of medians algorithm to find pivot element for quickselect - but this is kinda pointless because median of medians algorithm can produce result directly. Formally it'll be quickselect which produces result in O(N) but on practice it's be just a useless addendum to another algorithm which solves the task just fine on it's own. If someone can offer useful way to find “good” pivot for quickselect (which does not use another algorithm capable of solving the task on it's own) with guaranteed complexity O(N) - it'll be interesting. So far I've not seen such algorithms.

Note that while median of medians algorithm is based on quickselect it's quite distinct from quickselect. For example quickselect recursively calls itself once on each step while median of medians algorithm calls itself twice on each step.

and we all understand how median-of-medians works.

And this is yet another NONSENSE
0. Yes, we (as in: HelloWorld, me, and now even nybble41) understand how it works. You still refuse to accept it.
1. Your algorithm produces median of medians in O(N log N), not in O(N).
2. Apparently it's still not obvious for you that you can only find median of medians in O(N) if you can find just median in O(N) - and then you can just use said median as pivot point!

When/if you'll understand how median of medians algorithm works you'll understand why you was wrong all along - from your first post in this thread. The fact that you groups include 3 elements strongly suggests that you still don't understand how median of medians algorithm works.

By the way, I am probably not going to continue posting to this thread unless someone posts some math that's clearly wrong-- like the above confusion about constant factors.

Fine with me. The only one who posts “some math that's clearly wrong” in this thread is you, anyway.

And the saga continiues!

Posted Jan 16, 2012 22:05 UTC (Mon) by cmccabe (guest, #60281) [Link] (4 responses)

Let's derive the running time of the median of medians algorithm. Since you can use Google, you already know that the answer is O(n). But do you know why?

Median of medians is a divide-and-conquer algorithm. In each stage of the recursion, we split the array into K subarrays and recurse on them. To combine the results of those recursions, we do a constant amount of work.

So the running time for an array of length N is
T(n) = kT(n/k) + C
where k and C are constants. Usually K=5.

Luckily, we can solve this recurrence with case one of the master theorem. This gives a running time of O(n).

What if, instead, we did O(n) work to combine the results of the recursions? This is essentially what you are claiming.

Then the recurrence would be
T(n) = kT(n/k) + Cn + D
where k, C, and D are constants.
By case 2 of the master theorem, the running time would be O(n log n).

Incidentally, this is the reason why quicksort's running time is O(n log n)-- because it does O(n) work before doing each recursion. In quicksort's case, k = 2.

Anyone can use Google and find the running time of an algorithm. But unless you can derive it, you do not truly understand it. Perhaps you need to do a little bit less talking and a little bit more listening.

Is ignorance bliss?

Posted Jan 17, 2012 8:39 UTC (Tue) by khim (subscriber, #9252) [Link] (3 responses)

Let's derive the running time of the median of medians algorithm.

Let's do.

Since you can use Google, you already know that the answer is O(n). But do you know why?

Yes, I do. I also know that your contraption has nothing to do with medians of medians algorithm - that's why I was confused.

Median of medians is a divide-and-conquer algorithm. In each stage of the recursion, we split the array into K subarrays and recurse on them. To combine the results of those recursions, we do a constant amount of work.

Oops. Ah, now I see. Sorry, I missed the fact that you call median_of_medians recursively. Very embarrassing: I did the same mistake you did - have looked on the name of the algorithm and assumed it just picks medians of pieces and then selects median from these.

Well... this algorithm is linear, all right. The only problem: it does not guarantee linear complexity of quicksort! You basically split array in two uneven pieces, then combine six (if k == 5 then ten) such arrays to organize bigger array and you gurantee that at least two pieces go to the left and at least two pieces go to the right. This means that each recursion step potentially amplifies the disproportion. In the end you can have two pieces of quite disproportionate sizes. It's not clear if you can organize array in such a bad fashion as to push complexity of quicksort back to O(N²) but this looks highly probable.

The property of pivot produced by Median of Medians algorithm is quite different: it's always between 30% and 70% elements and these percentages do not depend on the number of recursive calls. Why? Median of Medians algorithm also introduces disproportions at each step, right? Yes, but it includes mechanism which fixes these disproportions. This is what guarantees O(N) complexity for finding true median and this is what guarantees O(N log N) complexity for quicksort.

Do you have any proof that your “median of median of median…” algorithm can not produce bad results at each step of quicksort? If not then this will put the whole excercise in the same bucket as “median of three” and not in the bucket of Median of Medians algorithm which guarantees O(N) complexity and guarantees that quicksort will not go to recursion lever deeper then log₂N. I've assumed that your code at least keeps the second property, but apparently you were more concerned with first. My bad.

Is ignorance bliss?

Posted Jan 19, 2012 8:02 UTC (Thu) by cmccabe (guest, #60281) [Link] (2 responses)

* Some people choose to call quickselect + the clever pivot selection algorithm "median of medians." Yes, this is confusing, given that finding the median of medians is only one step of the process. But it's not worth getting excited about.

* The clever pivot selection algorithm is still recursive. Yes, it's a recursive algorithm within another recursive algorithm. We are very clever, aren't we.

* When choosing a pivot for quickselect with the method I described, you need to have k=5 rather than k=3; otherwise the quickselect can still go n^2.

* Your prose reminds me "time cube." But that's probably because I was "educated stupid and evil."

Yet another small correction...

Posted Jan 19, 2012 8:42 UTC (Thu) by khim (subscriber, #9252) [Link]

When choosing a pivot for quickselect with the method I described, you need to have k=5 rather than k=3; otherwise the quickselect can still go n^2.

Sadly this is not true. Your argorithm will still introduce disbalance on each step - even with "k == 5". Disbalance will be smaller (2+0.3N/3+0.7N instead of 1+⅓N/2+⅔N), but recursive calls will still compound it thus the whole algorithm will have larger then O(N log N) complexity (most probably O(N²) with "evil source").

Median of Medians algorithm uses two recursive calls to battle this phenomenon: it finds "more-or-less Ok median" (30%/70%) using first recursive call with 0.2N elements and then it "fixes it" using another recursive call with no more then 0.7N elements. Disbalance is fixed at each step thus it does not grow beyond certain point (30%/70%) no matter how many steps there are - and the whole thing needs O(N) operations: T(N) ≤ c*N*(1 + (9/10) + (9/10)² + …) = O(N). If you'll use "k == 3" then your first pass will use ⅓N elements and second pass will use ⅔N elements and this will mean T(N) ≤ c*N*(1 + 1 + …) ≄ O(N).

Another note...

Posted Jan 19, 2012 9:23 UTC (Thu) by khim (subscriber, #9252) [Link]

There are another interesting fact related to Median of Medians algorithm: k must be at least 5 only because algorithm looks for kth largest element. In this case “k == 3” just does not work (as I've noted above). However if want to find median then not only "k == 3" works, it usually works better then “k == 5”. This is true because in the "find median" task you can throw away not just “top left” xor “bottom right” elements (as pictured in Wikipedia's illustration) but you can throw away both “top left” and “bottom right” elements. This will lead to complexity of T(N) ≤ C₁*N*(1 + (⅗) + (⅗)² + …) = O(N) for “k == 5” and to complexity of T(N) ≤ C₂*N*(1 + (⅔) + (⅔)² + …) = O(N) for “k == 3”, but ⅗ (for “k == 5”) comes from “⅕ + ⅖” (for two recursive calls) while ⅔ (for “k == 3”) comes from “⅓ + ⅓” (for two recursive calls). Thus in most cases version with “k == 3” is faster (because recursion depth is smaller), but difference is small and you must correctly handle case of N=3M+1...

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 20:16 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (2 responses)

As I understand it, Perl switched to a random pivot at the same time they moved from quicksort to merge sort as their default algorithm. But that still has some potential problems. After all, if quicksort and merge sort are both O(NlogN), then which one is faster depends on implementation details, not algorithmic complexity. Adding extra processor cycles to pick the pivot may slow the whole thing down to the point that it loses its average case speed advantage over merge sort.

And it seems like using a random pivot would exacerbate the problem with quicksort not being a stable sort. Not only would items that compare the same not be guaranteed to stay in the same order after the sort, they wouldn't even be guaranteed to be in the same order if you sort the same list twice. I guess you already accepted that when you chose an unstable sort algorithm, but it still seems weird.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 20:49 UTC (Fri) by joey (guest, #328) [Link] (1 responses)

Perl's sort does sort the same list the same way twice, despite being unstable. If the pivot is chosen randomly, it must be pseudorandom with a seed of the input list.

To test this I used code like this, which exercises the instability of the sort:

perl -MData::Dumper -e 'print Dumper sort { $a->[0] <=> $b->[1] } ([1,1],[2,3],[1,2],[2,1],[2,4])'

perl -le 'sub c { $a->[0] <=> $b->[1] } @l=([1,1],[2,3],[1,2],[2,1],[2,4]); print sort(\&c, @l) == sort(&c, @l)'

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 23:20 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link]

That's because they changed the default sort to merge sort, which is stable, starting in 5.8. If you want to test whether their quicksort is stable from run to run over the same data, you'd need to use sort '_quicksort'; no sort 'stable'; or the like to force it to use quicksort.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 14, 2012 23:03 UTC (Sat) by vonbrand (subscriber, #4458) [Link]

Here are a few references on quicksort's worst case in my reference list.

Muser, David, "Introspective Sorting and Selection Algorithms", Software -- Practice and Experience, 27(8), 983-993 (1997), in PostScript
McIllroy, M. Douglas, "A Killer Adversary for Quicksort", Software -- Practice and Experience, 29(4), 341-344 (1999)

Worst case performance and "impossibly unlikely" conditions

Posted Jan 11, 2012 23:49 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Tree containers are not without their problems. You have to define comparators for all your key types, while hash containers can 'just work' by using system-provided hashes.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 20:53 UTC (Fri) by jzbiciak (guest, #5246) [Link]

I think the comment on trees vs. hashes was meant to consider an apples-to-apples comparison at the next level up. For example, if you were to replace Perl's "hash" implementation with an automatically balanced tree representation--eg. AVL tree or RB tree or the like--under the hood, almost nobody would notice, but the time complexity would change from "nearly constant time for non-pathological data" to "logarithmic time for all data." Perl programmers don't change a line of Perl, they just see different performance characteristics.

Now, the details of other languages container types is a different story. I realize that the comment you replied to mentions C++, and your complaint seems C++ specific. But if you compare a hash_map to a map where your keys are strings, there really isn't a lot of difference between using either. You either provide an equals operator or a less-than operator. Ooooh.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 11, 2012 23:56 UTC (Wed) by nix (subscriber, #2304) [Link] (7 responses)

The slower ordered-tree based dictionary implementations

... aren't necessarily slower. The last hash I implemented was a modified-splay-tree-based system where each get of a given hash item rotated it 1/Nth of the way towards the top of the tree (a customizable value). Finding a random item scales as O(lg n), which is admittedly slower than the O(1) of a perfect hash or sufficiently empty bucket hash -- but repeated searches on the same set of keys end up O(1) or O(m) where m is the size of the searched set, and additions are always O(lg n) as well. This is definitely not true of bucket hashes, as they have to be periodically expanded and perhaps contracted for efficiency, leading to nasty spikes in the time profile. Having O(1) operations suddenly take on the order of O(n)-with-a-large-constant time while an expansion happens is... not good. I know it amortizes to O(log n), but still it's a jagged and nasty way to do it.

That splay-tree-based hash had other nice properties, such as an iteration order stable under non-colliding modification, and even stable under collision with certain limitations (if you deleted the currently-iterated value, and earlier-iterated values had colliding hash values, you would see such values more than once). It was also really easy to lock for multithreaded use, and copying and freeing it was pleasantly easy.

But its primary attraction to me was the limited conditional count. Because of the absence of expansion code, the only conditional in the whole thing other than the odd-or-even part of the tree rotation code was the trivial five-line code to handle hash collisions, which could have been omitted had I chosen to record full 64-bit hash values. There was no rarely-executed expansion code, no rehashing, nothing rarely executed to harbour bugs. Or there wasn't until I added extra code to efficiently pack small-enough data items into the item pointers, anyway...

(The thing was under a proprietary license thanks to my old free-software-phobic employer, but I will probably be reimplementing it and a number of related data structures under the BSD license sometime soonish. I'm sick of not having them available. :)

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 1:26 UTC (Thu) by cmccabe (guest, #60281) [Link] (4 responses)

Based on some things I've read, splay trees may not be the best way to go on modern hardware. The problem is that they turn what would normally be read-only operations into read+write operations. That means cache lines get dirty, If your program has multiple threads working on multiple CPUs, the overhead of cache line invalidation can really start to hurt.

There was a post about this in 2010 on the freebsd list:
http://lists.freebsd.org/pipermail/freebsd-hackers/2010-S...

Basically, it was a great idea in the 1980s, but I'm not sure how well it works these days.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 1:52 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

It's easy enough to test. NetBSD, FreeBSD, and OpenBSD all have <sys/tree.h>, containing preprocessor based splay and red-black tree implementations which, other than the algorithm, are implemented almost identically. It's self contained, so you can download the single header file. man 3 tree for the API.

Those kernels tend to rely on RB trees fairly heavily. OpenBSD uses <sys/tree.h> for almost everything, including PF and the VM. It's perhaps not as fast as the radix, etc., trees in Linux, but it's simple and is unlikely to create any surprises in the future.

I wrote a left-leaning RB tree variant which is a drop-in replacement, except the prefix is LLRB instead of RB. LLRBs have a simpler algorithm, so it's easier to hack on if necessary.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 12, 2012 21:42 UTC (Thu) by cmccabe (guest, #60281) [Link]

Yeah, sys/tree.h is useful. I tend to avoid the splay tree parts, though. To be fair, I've been too lazy to actually benchmark the two implementations.

Worst case performance and "impossibly unlikely" conditions

Posted Jan 13, 2012 13:52 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

Indeed that is a downside. If your hash is often accessed from many different threads, and particularly if it is often randomly accessed, a splay tree may be a bad choice of implementation. However, if access is concentrated in a few threads (as is common for the vast majority of data structures but not e.g. for a hash tracking all memory in a multithreaded kernel), *or* if your access patterns exhibit strong locality of reference (so that item percolation to the top of the tree is worthwhile), *or* if your overheads are elsewhere such that you don't care about a bit of cache overhead (true everywhere but in hot paths), a splay tree is either OK or downright good.

And I contend that in the vast majority of applications this is true. This site has a focus on kernel development, but most programs are not the kernel and most data structures are not randomly accessed on a large number of CPUs in a very hot path. And you need all of those things to be true at once before a splay tree's disadvantages become substantial. (That is not to say that it always has advantages over other implementations: I would only bother with a splay tree if I expected to be populating large hashes and then only using bits of them most of the time. But for some very common applications -- e.g. caching -- this is true.)

Worst case performance and "impossibly unlikely" conditions

Posted Jan 18, 2012 23:36 UTC (Wed) by cmccabe (guest, #60281) [Link]

This seems like the kind of thing where benchmarking is your friend. Luckily, it's fairly easy to go back and forth between red-black trees and splay trees.

I'm still wary of splay trees because they generate dirty cache lines which need to be written back to main memory at some point. That eats into the available memory bandwidth. And of course, even if all of your accesses are single-threaded, there's always the possibility of false sharing.

This is definitely not true of bucket hashes,

Posted Jan 20, 2012 1:58 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

Why?

The Pick database is based on bucket hashes, and modern implementations will resize so fast the user won't notice. "time is proportional to delta data" - given a 4k bucket it will take EXACTLY the same time to double the file size from 4K to 8K, as it will to increase the file size from 1M to 1028K.

Look up "linear hashing" in Wikipedia.

Cheers,
Wol

This is definitely not true of bucket hashes,

Posted Jan 26, 2012 18:09 UTC (Thu) by nix (subscriber, #2304) [Link]

Yes. It's not true. Those examples use pathetically tiny hashes (which were reasonably sized back in the 80s when Pick was big). These days, with the memory hierarchy being what it is, data sizes being what they are, and rehashing requiring massive pointer copying at best... well, just you try to rehash a hash containing fifty million elements and tell me that it'll not be slow enough to notice. (And this is not particularly large as such things go.)

Denial of service via hash collisions

Posted Jan 12, 2012 1:10 UTC (Thu) by fuhchee (guest, #40059) [Link] (3 responses)

What about the naive hash tables used in the linux kernel? Has someone analyzed whether any of them are sensitive to this sort of chosen-key attack?

Denial of service via hash collisions

Posted Jan 12, 2012 1:23 UTC (Thu) by jake (editor, #205) [Link]

> What about the naive hash tables used in the linux kernel? Has
> someone analyzed whether any of them are sensitive to this sort of
> chosen-key attack?

I have not, but I do note that the researchers who came up with the advisory this article is based on plan to look at Linux hash tables along with some other targets (listed at the end of their presentation).

jake

Denial of service via hash collisions

Posted Jan 12, 2012 1:25 UTC (Thu) by wahern (subscriber, #37304) [Link]

There's this attack from 2003, at least: http://xforce.iss.net/xforce/xfdb/12160

Here is one example of this problem being handled in the Linux kernel

Posted Jan 12, 2012 16:11 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Herbert Xu took this approach to solving this problem. His hash table will change the hash function at run time if any of the hash chains exceeds a specified length.

Run-time hash perturbation has also been used since the 1990s in net/sched/sch_sfq.c.

High level data structures in a language

Posted Jan 12, 2012 8:44 UTC (Thu) by seenutn (guest, #75111) [Link] (1 responses)

Does this mean that we should not provide high level data structures in the language and should be left to developer of each app?

Regards,
Seenu.

High level data structures in a language

Posted Jan 12, 2012 12:13 UTC (Thu) by robert_s (subscriber, #42402) [Link]

Yes, because that's going to reduce the number of bugs and security problems out there.

Denial of service via hash collisions

Posted Jan 12, 2012 10:57 UTC (Thu) by job (guest, #670) [Link] (1 responses)

There are some upsides to using a language that's been around the block a few times. Anyway, I think there are some comments regarding this in qmail from djb which precedes the Perl fixes, so this class of vulnerabilities has been known for a while.

Denial of service via hash collisions

Posted Jan 12, 2012 12:52 UTC (Thu) by Lennie (subscriber, #49641) [Link]

The presenters at 28c3 specifically mention DJB in their talk:

http://www.youtube.com/watch?v=R2Cq3CLI6H8

Denial of service via hash collisions

Posted Jan 12, 2012 14:43 UTC (Thu) by lzap (guest, #73396) [Link]

Looks like hash structures have became part of our lifes...

Collision resistance

Posted Jan 12, 2012 14:54 UTC (Thu) by sladen (guest, #27402) [Link] (1 responses)

The article above states that hash-functions for hash-tables should "still offer collision resistance." IIRC, in the talk delivered, the presenters offered:

28c3: Effective Denial of Service attacks against web application platforms (YouTube, offset 7:53, "…that's not the case actually, collision resistance is not part of the definition of a hash function")

Collision resistance

Posted Jan 13, 2012 9:34 UTC (Fri) by ekj (guest, #1524) [Link]

People mean different things when they say "hash function" and different uses have (very) different requirements.

Often, like when using hashes in a dictionary-implementation in a programming-language, we care most about average-performance, and not at all about collisions as such (except if they're common enough to impact average performance).

Other times, like when making a hash to digitally sign, we care *very* much that it be *exceedingly* difficult to find another input that hashes to the same thing, but we care relatively little about the performance of the hash-function.

What this points to, is just that programming-languages should probably care somewhat more about worst-case performance, to the detriment of average-case-performance. Because pathological worst-case-performance can frequently lead to denial-of-service problems when things are implemented using these techniques.

A sort that uses (100 n log n) for the average case, with worst-case performance of (200 n log n) can thus be preferable to a sort that is (50 n log n) for the average case, but that degrades to (50 n^2) for the worst-case.

The thing that makes security different from engineering is that engineering is programming Murphys computer - everything that can go wrong, will go wrong every now and then. But security, in the face of a deliberate attacker, is programming Satans computer: everything that can go wrong will *deliberately* forced to go wrong again-and-again.

So yes, collision resistance is not part of the definition of "hash function", indeed a hash-function is 'any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum' (some of these may be -poor- hash-functions for many purposes, but even poor hash-functions are hash-functions)

There are many different sub-families of hash-functions, Collision-resistance *is* part of the definition of "cryptographically strong hash-function", which is one such sub-family.

It's usually a trade-off, the functions with good collision-avoidance even for pathological inputs, are usually much more expensive to compute. And *any* hash-function has an upper-bound on collision-avoidance limited by the size of the hash. If you're using 32 bit hashes, the birthday-paradox means you can *always* generate collisions in ~2^16 attempts. Someone with a bit of CPU can thus produce (for example) a list of email-adresses to enter into a form, that contains 100 email-adresses that *all* hash to the same value, possibly giving pathological performance for whatever you're trying to do with them after collecting them.

Even 64 bit hashes aren't immune to this, and using hashes that are substantially larger than the word-size of the architecture you're running on, tends to put substantial dents in lookup-times for the average-case. But perhaps this is a cost we have to accept, to prevent worst-case-scenarios that lead to DOS.

Upgrading a simple 32-bit hash to a cryptographically strong hash of atleast 128-bit means taking a *substantial* performance-hit though.

Denial of service via hash collisions

Posted Jan 12, 2012 15:21 UTC (Thu) by luto (guest, #39314) [Link] (2 responses)

I'm somewhat surprised that no one has mentioned universal hashes (and almost universal hashes and whatever other fancy variants exist) as alternatives. Some of them can be very fast, at least on long inputs. If they have significant disadvantages, I'd like to hear about them.

Denial of service via hash collisions

Posted Jan 12, 2012 17:50 UTC (Thu) by niner (guest, #26151) [Link] (1 responses)

"Some of them can be very fast, at least on long inputs."

That's probably just not good enough. Languages like Perl and Python use hashes at the very core of their object models for accessing object attributes and methods (a bit of a simplification but the point stands). So even a small decrease in hashing performance will have serious impact on the performance of programs written in those languages.

Denial of service via hash collisions

Posted Jan 13, 2012 20:46 UTC (Fri) by wahern (subscriber, #37304) [Link]

A hash need only be computed once for an object, so it all depends on what the code is doing. For $bar->foo(), "foo" can be hashed once at compile time. But if I'm doing $bar{generatekey()}->(), where &generatekey creates a brand-spanking new string object each time, then, yes, it *might* matter.

But a lot depends on the algorithm. If the algorithm can be parallelized using SIMD or MIMD instructions--or just operating on longer words than a single (char)--then it doesn't necessarily matter that the hash is more complex; all that matters is whether the CPU can finish the job as fast as a simpler algorithm that updates state on every character.

The reason why universal hashes, etc, are competitive with brain-dead hashes for longish strings is because of cache locality--it takes longer to move data back-and-forth from cache then it does to compute even the more complex operations. So, again, you might still be able to compete with simpler algorithms if you still share the same bottleneck. There may be exploitable bottlenecks even for short, cache-local strings.

I don't think there has been enough work in this area. People have continued using the same hashes invented in the 1990s, in particular DJB's algorithm. As an aside, DJB fawns over "crit-bit" trees: http://cr.yp.to/critbit.html

Adaptively changing hash?

Posted Jan 12, 2012 19:06 UTC (Thu) by dskoll (subscriber, #1630) [Link] (1 responses)

I'm wondering if you could implement adaptive countermeasures. For example, if you notice a hash chain is too long (where "too long" depends on the number of elements in the hash table and the number of buckets), then you generate a new random seed and rehash the entire table?

Yes, this would suck, but presumably if you could detect an attack early enough, before the performance got really really bad, you might be able to thwart it. It also thwarts an attack that relies on knowing the per-invocation startup seed.

Of course, you'd need to use a high-quality random seed generator to prevent someone from figuring out the entire chain of seeds. Comments?

Adaptively changing hash?

Posted Jan 13, 2012 20:57 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Or switch to a balanced binary tree if you reseed too many times.

Combine HASHes with TREEs

Posted Jan 12, 2012 23:41 UTC (Thu) by ppisa (subscriber, #67307) [Link]

We use a little generalized AVL trees as simplest ordered/lookup structure in most of our code. It has little worse insertion and deletion performance than R-B, but maximal depth is smaller and lookups faster. But for cases where O(logN) is too big we use hashes where each has table entry i pointer to AVL tree of items. If the constant has table size is (ul_hashtab_sizestep_null) used then there is no malloc required for insert/delete but tree node structure has to be available in inserted items (same as for actual R-B tree Linux implementation).

The tuning of hash table size then tunes between O(1) but big amount of abundant memory filled by unused NULLs and O(logN) for total collision case.

If and default ul_hashtab_sizestep_default_table is used, the implementation tries to tune table and rehash data when the hash table size is too small or big for given item count.

Use example/test code is in ul_hashtabchk.c file.

Solution is not perfect but should prevent DoS quite well. The reason behind my decision for this structure was our real-time/control systems/application orientation where worst case scenarios/limits/considerations are the mantra.

The code is part of uLUt uLan libraries available under GPL, LGPL and MPL.

http://ulan.sourceforge.net/

Denial of service via hash collisions

Posted Jan 14, 2012 11:48 UTC (Sat) by liljencrantz (guest, #28458) [Link] (2 responses)

This problem is actually not fixable in Java without first updating the standard. The current Java standard mandates that the String hashCode is calculated as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Of course, one could still make case-by-case fixes for every piece of software that is vulnerable, like what Oracle is currently doing, but the number of libraries that will require such fixes are probably in the thousands. The odds of succeeding seem rather slim.

Mandating a specific hash algorithm always struck me as a bad idea, for exactly this type of reason. There is a security bug, and it's impossible to fix in a general way while still following the standard. It seems to me that as computers grow gradually faster, the trade-off of todays rather crappy, but fast hash functions will become increasingly less attractive. In a few years, I believe pretty much all hash functions will use cryptographically safe hash algorithms, just to be safe. Mandating a crappy algorithm in the standard needlessly locks you in the past.

Denial of service via hash collisions

Posted Jan 16, 2012 22:25 UTC (Mon) by NAR (subscriber, #1313) [Link]

I wonder if the standardized algorithm was (is?) necessary for serialization...

Denial of service via hash collisions

Posted Jan 17, 2012 16:53 UTC (Tue) by akostadinov (guest, #48510) [Link]

What particular fix would you suggest for JAVA that is not possible with current standard constraints?
I have the impression the HashMap class (and maybe a couple more classes) can be made safer because they are widely used in UI frameworks and can be easily exploited. But I don't see anything preventing the HashMap implementation from having hash algorithm changed. Also it could be easily made dynamic through the rehash() method (i.e. change algo on rehash on certain conditions).

Denial of service via hash collisions

Posted Jan 16, 2012 10:31 UTC (Mon) by ekj (guest, #1524) [Link]

The hash-function used at the core of Python is vulnerable to this. It's trivial to provide "perverse" input where all the items hashes identically, and doing this changes the expected runtime of looking up a element in a dictionary from O(1) to O(n).

Potentially, this means you can create a DOS-attack on any python-program that puts user-provided things in a dictionary in such a way that the user influences the key, if the dict is large enough and/or the performance-constraints are tight enough that O(n) instead of O(1) matters.

The overhead is fairly high though, in practical testing even a 15-deep collision (i.e. a element that requires 15 attempts to locate) takes double time, compared to an element that's found at the first attempt. If it's linear (it should be, but I ain't actually tested that), then worst-case a 15-element dict would have only half the predicted performance - which is unlikely to cause a DOS.

But a 150 element-array could have 1/10th the expected performance, and "perverse" 1500-element dictionary could require 100 times as long to process as expected - and that starts looking like a potential DOS.

only a vulnerability in djbx33a?

Posted Jan 16, 2012 12:28 UTC (Mon) by wingo (guest, #26929) [Link] (2 responses)

What is the feasibility of this attack on other standard hash functions, I wonder?

Guile for example (on its master branch) uses Bob Jenkin's lookup3:

http://burtleburtle.net/bob/hash/index.html#lookup

only a vulnerability in djbx33a?

Posted Jan 17, 2012 2:46 UTC (Tue) by wahern (subscriber, #37304) [Link]

Broken for static initialization vectors, at least:

http://www.team5150.com/~andrew/blog/2007/03/breaking_sup...

only a vulnerability in djbx33a?

Posted Apr 14, 2014 22:04 UTC (Mon) by rurban (guest, #96594) [Link]

Every single hash table which uses unsorted linear linked lists is attackable by this, because the attack goes against the unsorted list, and the hash function can not help much. I wonder why everybody wants to improve X when Y is the problem, not X.
The random seeding also does not help much, as the seed can be attacked also (e.g. "REMOTE ALGORITHMIC COMPLEXITY ATTACKS AGAINST RANDOMIZED HASH TABLES", N Bar-Yosef, A Wool - 2009 - Springer). And if you've got a seed you get the collisions pretty easily with the right tools. You don't even need to use the \0 trick to which perl (<5.18) were and most other languages still are susceptible.

See https://github.com/rurban/perl-hash-stats where I added some stats and analysis for the avg. and worst cases, and how to fix this problem.

Keeping sorted bucket collisions or perfect hashes are the easiest fixes. It depends on the usage scenario and hash table sizes. Google uses perfect hashes, languages and smaller usages (i.e. the linux kernel, caches) should typically use sorted bucket collisions. They also improve cache lookup performance as with open addressing. Robin-hood hashing also looks good theoretically, but I haven't tested it yet against such attacks.

Detecting hash flooding as done by DJB's DNS server or limiting MAX_POST_SIZE as done with PHP is also fine.

About Future CPUs

Posted Jan 23, 2012 1:09 UTC (Mon) by XERC (guest, #14626) [Link] (3 responses)

If there are graphics accelerators, which nowadays are integrated with the CPU chip, then why can't there be text accelerators for web servers and alike?

For example, there might be some sort of a CPU instruction that is specially meant for hardware based sorting of integers in a given memory range. The instruction can impose a requirement that the memory range is small enough to fit the CPU's L1 cache and the maximum value for the memory region size might be available with one other CPU instruction.

If graphics cards were inserted to PC-s as an add-on, then the web servers could use some text-cards that use hardware based sorting for UTF-8 text. Some open-source, system-wide, C library might just abstract away the hardware so that PC-s that don't have the special hardware, do it on the main CPU and everyone would be happy.

After all, that's how historically the floating point arithmetic got integrated to the main CPU and that's how the graphics accelerators (GPU-s) got integrated with the most modern CPU chips.

Given the popularity of hash-tables in modern programming languages, hardware support for some seed based, not necessarily cryptographically suitable, hash function is also pretty welcome. After all, CPU-s are for running software, not the other way around.

About Future CPUs

Posted Jan 23, 2012 9:24 UTC (Mon) by ekj (guest, #1524) [Link] (2 responses)

Sorting integers is O(n). It's just the general sorting-problem, where your operations are limited to compare() and swap() which has n*log n as a lower bound.

The problem isn't that sorting is heavy work, it typically isn't.

The problem is that with certain pathological input (malicious attacks) sorting dict-insertion and dict-lookup can be very much slower than expected.

This can be avoided with different choice of hash-function for the dictionary, but that will tend to be slower for the typical (non-malicious) case.

Putting the thing in hardware wouldn't really change that.

Actually it will

Posted Jan 23, 2012 12:30 UTC (Mon) by khim (subscriber, #9252) [Link] (1 responses)

Putting the thing in hardware wouldn't really change that.

But it would! Good hash functions are pretty hard to implement in software because you basically need to randomly shuffle bits around - and it's hard to do that in software. In hardware it's trivial.

AES-NI instructions have sustained speed of about 1byte/tick (of course small strings are slower). It'll be interesting to see how large of a hit it'll produce in python, for example. It will probably be slower then current hash, but difference may be small enough for real usage.

Actually it will

Posted Jan 23, 2012 12:40 UTC (Mon) by ekj (guest, #1524) [Link]

I guess I was being unclear.

What I meant is that in typical use, producing hashes to use for inserting and/or looking things up in dictionaries is not a major performance-impact for Python.

I don't doubt that hashes can be produces quicker by dedicated hardware, but if you're using a small fraction of your time for that purpose to start with, then the gains from reducing it, are modest.

It'd be interesting benchmarking a variety of workloads in a variety of languages to see what fraction of time is used for hashing though, because I'm really just guessing here, and I could be wrong.