Showing posts with label Randomness. Show all posts
Showing posts with label Randomness. Show all posts

Wednesday, September 14, 2011

Can you win the lottery too many times?

Last year I posted on The Fabled 25 Sigma Event, referring to a quote from David Viniar, then CFO of Goldman Sachs, who was attempting to describe the magnitude of the movements in the financial markets. Mr. Viniar probably did not fully understand the implications of what he was saying, since a 25 sigma event translates into a phenomenon occurring once every 10^{135} years - a period of time that we have yet to see even a fraction of. Several researchers at the business school of the University College Dublin gave another interpretation of how unlikely this event was by stating that it equates to winning the UK lottery more than 20 times in a row.

Winning the lottery 20 times does seem very unlikely. Recently a woman won the Texas lottery for the fourth time in the last 10 years or so, accumulating prize money of  just over 20 million USD, and is being scrutinized by the press for potential fraud. There is a lot of suspicion about the luck of Joan Ginther (pictured below) and her winning streak. Googling on “4 time lottery winner” produces pages of articles on Ginther’s supposed luck.

image

Nathaniel Rich ran an interesting 4-page story in the August issue of Harper’s magazine, where he visits the small Texas town of Bishop to look at the lone town store where three of the winning tickets were purchased. Rich spoke to enough mathematics professors beforehand to determine that the odds of an individual winning four times by pure luck are extremely low indeed, about 10^{-24}, or a practical impossibility (still “far more likely” than a 25 sigma event though). The alternate scenarios are (1) an inside job potentially amongst the state lotteries and their suppliers (2) cracking the parameters of the psuedo-random number generator for selecting the winners, and (3) dumb luck, or increasing your odds of winning by buying many tickets. The most likely answer seems to be a combination of (2) and (3).

The local town people are going with scenario 3 or just ascribing it to pure luck outright, as there is a strong (American) belief that everyone can be a winner. Getting back to those 25 sigma events, it seems then that no one would actually be able to win the UK lottery over 20 times as they would be suspected of foul play, and likely to find themselves arrested way before that many wins. Perhaps Mr. Viniar should have been arrested for his remarks.

Friday, September 9, 2011

Two victories for Randomness

I recently came across two smallish examples of where randomness was the solution to two perplexing problems. That is, rolling the dice seems to help you out of a situation where a planned method was not giving you what you wanted.

The first issue is the problem of how to board passengers on a plane. Finding the best way to board people is actually a well-studied problem, both theoretically and in practice, and you can see some of the work here. At the top of the same page there is a nice simulation program which shows you how different boarding strategies play out, and random boarding (just calling out people to board at random) is better than the usual front-to-back boarding that most of us are familiar with.

image

The reason is that random boarding gives a better utilization of the space in the plane whereas front-to-back boarding piles people into one part of the plane, eventually causing jams in the aisles. The full set of strategies examined are

  • Back-to-front
  • Rotating-zone
  • Random
  • Block
  • Outisde-in
  • Reverse-pyramid

On another topic, a Freakonomics blog post describes how researchers in South Africa are using a randomness trick to get truthful answers from farmers who are suspected of illegally killing leopards and hyenas. The method is called randomized response surveying, where when the farmers are asked potentially incriminating questions they first flip a coin, and based on the result give a yes or no answer to either the incriminating question if it was heads, or a harmless question (do you think the Springboks will win the RWC?) if it was tails. The farmers actually used a die, taking specific actions on which value from 1 to 6 was thrown, but the principle is the same as I have described it.

The trick here is that the person asking the question cannot tell which question the farmer is answering, but the farmer’s answer can be recorded. Statistical methods can then be used to determine the distribution of answers for the two questions, and actually make inferences about the proportion of positive answers to the incriminating question. This method was devised in the 60’s, and by the early 80’s it was being taught at my undergraduate university as part of a first year course.

Friday, February 12, 2010

Another source of USB Randomness

Back in December I posted about a new USB device with dedicated hardware for producing a continuous stream of high entropy bits based on sampling P-N junctions. Another lower tech randomness source with a USB interface is described at this site, and is shown below

image The device has a small hourglass, and as the sand falls from the upper to the lower chamber, the pattern of grains is sampled against a light-sensitive detector at a rate of 100 times per second. The site claims that each sample yields about 9 bits of entropy based on statistical tests. The device detects when all the sand has passed to the lower chamber and then rotates the hourglass 180 degrees, so the sampling process can continue. The samples can be accessed through a USB interface.

The device is a prototype, and as yet not for sale, but costs about $100 to produce. The advantage of the hourglass method over other more sophisticated and higher yielding devices is simplicity and transparency. Perhaps so, and you can read more about the design here, and the entropy of the output here. Finally

While the hourglass is not precise, accurate, or repeatable as a timekeeper, and has been almost completely supplanted by better devices, it is a good source of random entropy. It is still manufactured in quantity at low cost, and it is clean, compact, durable, and uses little energy. The source of the random entropy can be easily understood, and observed to be functioning correctly without instruments. An off-the-shelf photointerrupter can be employed to electronically observe the random entropy, and an open-source, standardized microcontroller can be used to control the process and interface it with a host computer.

Tuesday, March 31, 2009

Randomness Tests for Packed Malware

image In my post Risk Factors for AV Scanning, I discussed a recent patent filing by Kaspersky related to reducing the amount of malware scanning based on various file properties. One of those properties was whether the file is packed or not, since packed files are to be treated more suspiciously than plain unpacked files.

The impact of packed malware is discussed more thoroughly by Tzi-cker Chiueh of Symantec Research Labs in a 2007 presentation on the difficulties in detecting malware, where he devotes 5 slides to the "Packer Problem". Packing adds another dimension to the production of malware signatures which must now account for potentially ten or more layers of packing that obscure the real code that needs to be scanned. Symantec knows of over 1000 different packing programs all of which need to be recognised and stripped off to apply signature scanning.

But is it possible to distinguish between suspicious and actual packed malware using statistical properties of the files without unpacking?

Three authors (Tim Erbinger, Li Sun & Serdar Boztas) have done some interesting research into developing specific randomness tests for detecting packed malware. As it turns out, malware does exhibit a definite randomness signal when the randomness measure accounts for local patterns and structure.

The paper begins by observing that common randomness measures, such as the entropy or other statistical tests, compute a global measure of randomness over all available data and tend to obscure (local) sections of code that may be highly random or highly structured. The figure below shows the distribution of bytes in the program calc.exe before and after packing with UPX. The distribution becomes more uniform after packing, but still far from uniform.

image

The authors proposed several randomness tests that preserve locality (structure or lack thereof), based on constructing a Huffman tree at the byte level of the data (packed file). The Huffman tree algorithm computes codes for each observed byte where more frequent bytes are assigned shorter codes. If the data is random, causing the byte frequencies to be similar, then the Huffman tree codes will be roughly of similar length. Structured data on the other hand will produce codes of quite differing lengths.


image

The authors then sample the data at various bytes positions, collecting the corresponding Huffman codes, and then normalise the code lengths between 0 and 1 (where a higher value means more random). The byte sampling strategies proposed are (1) sample a fixed number of bytes equally spaced in the data (2) slide a window of fixed size across all the data. Below we see the sliding window test applied to the basename binary from UnxUtils before and after being packed with FGS 2.0.

image

image
UnxUtils has 116 binaries ranging in size from a few KB to about 200 KB, and the authors performed a collection of experiments to validate their local randomness tests using 6 well-known packers.
The authors observe that there is a definite characteristic signal of packed files. They go onto to further propose additional tests that can can be used to discriminate between packers.

Returning to Tzi-cker Chiueh and his packer problem, he suggests to workaround packing by Just-in-Time scanning approach which tries to scan a file just before it runs in its unpacked form, and also to consider whitelisting. Perhaps this new local randomness test from Erbinger, Sun & Boztas can help him out.


Related Posts

Sunday, July 13, 2008

A Blackish Swan for Debian Crypto

debian In May Luciano Bello discovered a flaw in the Debian version of Linux which, as described in CVE-2008-0166, yielded "a random number generator that generates predictable numbers, which makes it easier for remote attackers to conduct brute force guessing attacks against cryptographic keys". While this sentence is unlikely to win any awards for clarity, the graveness of the situation should nonetheless be apparent.

Since September 2006 the Debian Etch distribution has been producing weak (public) keys for well-known protocols such as SSL and SSH, as well as certificates for personal encryption and signing. A few days after the announcement SANS raised its INFOCon level to yellow, indicating an increased threat against the general internet infrastructure (the INFOCon level has since returned to green). While the offending Debian code is easily patched (only two lines need to be uncommented) it will take much longer to assess the impact of over 18 months of weak keys being handed out by Debian.

The Best Laid Plans

The root cause of the incident centres around memory allocation. Programs frequently request additional memory when they are running. The operating system (OS) satisfies such requests by finding an available block of memory, marking the block as allocated, and returning the starting address of the block to the requesting program for access.

The memory allocated to a program request will not be empty - the block will always contain some values in terms of zeros and ones. These values are typically what was left behind by the last program that used the allocated locations.

When a program is allocated memory it is good programming style for the programmer to assign known values to the locations so that subsequent computations involving these locations begin from a known state. For example, if one of the allocated locations was to be used for a variable hits to count the number of visitors to your web site, then hits should be initialised to zero. If the last value assigned to the memory location of hits was 192 or -517, then simply incrementing hits for each new web site visitor will not achieve the desired result.

Code scanning tools are able to detect instances of memory that are allocated to a program but are not initialised before being used. The scanning tools will return warnings and advise that initial values be assigned. In general this is good advice unless the programmer had an unconventional reason for using uninitialised memory.

doh But the programmers of the OpenSSL PRNG apparently had such a reason. They actually wanted to have memory allocated for the seed to the PRNG and use the previous values contained in these locations as part of the seed. Later, another Debian programmer seeing that code scanning tools were complaining about uninitialised memory in the OpenSSL PRNG took the well-intentioned step of commenting out the offending code, removing the code from the resulting executable program.

The offending code consisted of changes to just two lines.

Commenting out these two lines (unfortunately) did not stop the PRNG from working but had the effect of reducing its seed to be just the process identifier (PID) for the program. The PID is a value used by the OS to distinguish one process from another for administration purposes. The PID is typically a 15-bit value that is able to represent numbers in the range 0 to 32,767, or less than 33,000 distinct values in total. Cryptographic keys that were then generated from such a seeded PRNG are selected from only a small number of possible values and can therefore be easily broken. Instead of keys being based on hundreds or thousands of bits of entropy, keys were based on at most 15 bits of entropy.

So in summary: the seed generation process for OpenSSL PRNG was designed to rely on using previously assigned values to dynamically allocated memory. This unusual programming practice was not well-communicated, and a subsequent Debian programmer removed the two lines of code that seeded the PRNG through memory allocation. The effect was to produce a PRNG that was only able to produce 33,000 different values, which leads to a predictably small number of cryptographic keys.

The impact of changing two lines of code

dominos

The scope of the impact of Debian creating weak keys is multidimensional: the time scales, other code distributions, security protocols, security credentials and finally data. With respect to time, the PRNG programming flaw has been present in Debian since September 2006 (the Etch release), and was not publicly detected until May this year, meaning that weak keys have been generated on the Debian distribution for over 18 months now. Further, other Linux distributions derived from Debian, including versions of Knoppix and Ubuntu, have also inherited the flaw over this time period.

Security protocols whose underlying cryptography was weakened include VPN, SSH, SSL and its derivatives such as FTPS, as well as all X.509 key material (general public/private key pairs). The full list of impacted security protocols and applications is quite extensive. Further, internet-facing servers with weak SSL certificates bound to a domain name, such as www.mycompany.com, will need to be revoked and reissued by the company administering that domain. An article in the Register, called Debian's Epic SSL Blunder, states that the number of SSL certificates that may need replacing could be in the hundreds of thousands or even millions. So while the OpenSSL PRNG code can be easily patched identifying and replacing all the weak keys generated by the flawed code is a big operational headache. It may be months or years before all the weak keys and their corresponding certificates are tracked down and upgraded.

Another serious threat is the potential exposure of data, including security credentials such as passwords, which were transported over public networks under the false assumption that they were protected by strong cryptography. It has already been suggested in a Heise Security article that secret service organisations have been exploiting the weaknesses in communication.

A Blackish Swan?

Black Swan According to Wikipedia, a black swan "is a large-impact, hard-to-predict, and rare event beyond the realm of normal expectations". Are we dealing with a Debian Black Swan? It is too early to judge this incident as an outright Black Swan, or the catalyst for an imminent Black Swan, since the consequences have not fully played out. Certainly the incident has highlighted potential weaknesses in the open source programming model, if not in general, then at least in the interaction between open source developers and vendor developers. Ben Laurie of OpenSSL originally blasted Debian for changing code that they "did not understand", but in a subsequent post his tone was more conciliatory as evidence of the change being discussed on OpenSSL mailing lists was presented. His final remark was "I welcome any suggestions to improve this situation".

Flaws related to encryption always make good copy, and on occasion, strike at the heart of our fundamental beliefs in security. When encryption falters the whole edifice of security seems shaken.

You can find the research used to produce this post as a FreeMind mindmap rendered into Flash here.