No Tricks: Google

Showing posts with label Google. Show all posts

Wednesday, May 26, 2010

Google could help enforce new German wireless protection law

A German court has ruled that home users are responsible for creating password-protected home wireless networks, and failing to do so could result in a fine a 100 euros. The rulings stems from a case where a musician sued the owner of a home WiFi network for illegally downloading his music, but since the network owner was away on holiday, the open network was being used by another third party.

The maximum fine of 100 euros (or about $120) is about the same as a hefty speeding ticket, and with 26 million WiFi enabled German households, that could add up. All those Google Street View spycars could help out with enforcement of the law, since they have been collecting wireless information anyway.

Tuesday, May 18, 2010

Two universities rethink Gmail migration plans

The University of California at Davis (UCD) and Yale University were considering moving their email systems onto Gmail, but both have put those plans on hold for the moment. The CIO of UCD, Peter Siegel, said that he was not prepared to risk the security or privacy of the school’s 30,000 faculty and staff.

Yale has delayed a more general migration to Google apps, including Gmail, citing security and privacy concerns over cloud-based management of their data. Michael Fischer, a computing professor, said that

Google stores every piece of data in three centers randomly chosen from the many it operates worldwide in order to guard the company’s ability to recover lost information — but that also makes the data subject to the vagaries of foreign laws and governments, Fischer said. He added that Google was not willing to provide ITS with a list of countries to which the University’s data could be sent, but only a list of about 15 countries to which the data would not be sent.

So there is a concern that the personal data of students and faculty is being stored outside US jurisdictions. However neither UCD or Yale ruled out migrating to Google cloud applications once there was adequate transparency for the protection of data.

Sunday, December 6, 2009

Tidbits: A5/1, WhiteListing and Google

There is a nice write-up in the very respectable IEEE Spectrum on the A5/1 rainbow table generation project run by Karsten Knol for cracking GSM encryption. The write-up for Knol in the IEEE is a signal that his project has mainstream acceptance and attention. The timeliness of Knol’s project was heightened a few weeks ago when MasterCard announced that they would be using GSM as an additional authentication factor in their transactions. This MasterCard feature may end up being very popular since IDC recently predicted that the mobile phone market will exceed one billion handsets by the end of 2010. Hopefully this huge demand will increase the deployment of the the stronger A5/3 algorithm that is being phased in during upgrades to 3G networks.

The evidence continues to mount that whitelisting is an idea whose time has come. Roger Grimes at InfoWorld published a large list of reviews on whitelisting products, and he was pleasantly surprised that “whitelisting solutions are proving to be mature, capable, and manageable enough to provide significant protection while still giving trustworthy users room to breathe”. Given that the amount of malware introduced in 2008 exceeded all known malware from previous years, viable alternatives to the current blacklisting model are needed.

Finally, Google recently announced that it will be offering its own DNS service, nominally to increase performance and security. Sean Michael Kerner wonders whether Google DNS is resistant to the attacks reported by Dan Kaminsky last year, where common DNS implementations relied on poor random numbers. H. D. Moore has forwarded Kerner some graphs he made from sampling Google DNS for randomness. First, a plot on port scanning

and then a plot of source ports against transactions IDs

In both cases the graphs look quite random, and so both Kerner and Moore conclude that Google seems to be a right track here. But Moore concedes that more testing to be done.

Wednesday, December 2, 2009

Google Maps and Crypto Laws Mashup

Simon Hunt, VP at McAfee, has a great Google map mashup application that visually maps crypto laws to countries around the world, including individual US states. The map was last updated in September.

Monday, March 2, 2009

The Wisdom of a Random Crowd of One

(This is a repost as the old link stopped working)

There was a recent excellent post on the RiskAnalysis.Is b log reviewing a debate between security gurus Bruce Schneier and Marcus Ranum on the topic of risk management. The post summarized the debate as something of a stalemate, ending in agreement that the lack of data is the root cause of the unsatisfactory state of IT risk management. The post goes on to make some excellent points about risk and data, which deserves a post of its own to describe and ponder. Here I will make a few points about data, models and analysis.

Data is a representation of the real world, observable but in general difficult to understand. The model is a simplification of reality that can be bootstrapped from the data. Finally, analysis is the tools and techniques that extract meaning from the model, which hopefully allows us to make material statements about the real world. Data, Model, Analysis, Meaning.

Let's take a look at how the famous PageRank algorithm creates meaning from data via analysis of a model. We can all really learn something here.

The Wisdom of a Random Crowd of One

The hero of the PageRank story is an anonymous and robotic random surfer. He selects a random (arbitrary) starting page on the internet, looks at links on that page, and then selects one to follow at random (each link is equally likely to be selected). On the new page, he again looks over the links and surfs along a random link. He happily continues following links in this fashion. However, every now and again, the random surfer decides to jump to a totally random page where he then follows random links once again. If we could stand back and watch our random surfer, we would see him follow a series of random links, then teleport to another part of the internet, follow another series of links, teleport, follow links, teleport, follow links, and so on, ad infinitum.

Let's assume that as our random surfer performs this mix of random linking and teleporting, he also takes the time to cast a vote of importance for each page he visits. So if he visits a page 10 times, then the random surfer allocates 10 votes to that page. Surprisingly, the PageRank metric is directly derived from the relative sizes of the page votes cast by this (infinite) random surfer process.

This seems deeply counterintuitive. Why would we expect the surfing habits of a random process to yield a useful guideline to the importance of pages on the internet? While the surfing habits of people may be time consuming, and sometimes downright wasteful, we probably all think of ourselves as more than random-clicking automatons. However the proof of the pudding is in the searching, and Google has 70% of the search market. So apparently when all of the erratic meanderings of the random surfer are aggregated over a sufficiently long period, they do in fact provide a practical measure of internet page importance. We cannot explain this phenomenon any better than by simply labelling it as the wisdom of a random crowd of one.

Breathing Life into the Random Surfer

The data that can be gathered relatively easily by web crawlers are the set of links on a given page and the set of pages they point to. Let's assume that there are M pages currently on the internet, where M is several billion or so. We can arrange the link information into M x M matrix P = [ Pij ] where Pij is the probability that page Pi links to page Pj (Pij is just the number of links on Pi to Pj divided by the total number of links on Pi).

The matrix P is called stochastic since the sum of each row is 1, which simply means that any page must link to somewhere (if Pi has no links then it links to itself with probability 1). So P represents the probability of surfing (linking) from Pi to Pj in one click by a random surfer. The nice property of P is that P^2 = P*P gives the probability that Pj can be reached from Pi in two clicks by the random surfer. And in general P^N gives the probability that the random surfer ends up on page Pj after N clicks, starting from page Pi.

Can we say anything about P^N as N becomes large? Well if P is ergodic (defined below) then there will exist a probability vector

L = (p1, p2, ..., pM)

such that as N becomes large then

P^N = (L, L, ..., L)^t

This says that for large N, the rows of P^N are all tending to the common distribution L. So no matter what page Pi the random surfer starts surfing from, his long run page visiting behaviour is described by L. We learn quite a bit about the random surfer from L.

As we said above, the long run probability vector L only exists for matrices that are ergodic. Ergodic matrices are described by 3 properties: they are finite, irreducible, and aperiodic. Our matrix P is large but certainly finite. Two pages Pi and Pj are said to communicate if it is possible to reach Pj by following a series of links beginning at page Pi. The matrix P is irreducible if all pairs of pages communicate. But this is clearly not the case, since some pages have no links for example (so-called dangling pages). If our random surfer hits such a page then he gets stuck, and we don't get irreducibility and we don't get L.

To get the random surfer up and surfing again we make the following adjustment to P. Recall that we have M pages and let R be the M x M matrix for which each entry is 1/M . That is, R models the ultimate random surfer who can jump from any page to any page in one click. Let d be a value less than one and create a new matrix G (the Google matrix) where

G = d*P + (1-d)*R

That is, G is a combination of P (real link data) and R (random link data). Our random surfer then follows links in P with probability d or jumps (teleports) to a totally random page with probability (1-d). The value of d will be something like 0.85.

Its should be easy to see that G is irreducible since R enables any two pages to communicate in one clieck. Without going into details, G is also aperiodic since it is possible for a page to link to itself (which is also possible in P as well). So G is ergodic and we can in theory compute the long run page distribution L of the random surfer.

So now that we know L exists for G, it remains to compute it. We have not as yet considered that the number of pages M is several billion and growing. So a direct representation of G as an M x M matrix would require storage on the order of 10^(18), or in the exabyte range (giga, tera, peta, exa). Luckily most pages are likely to have only a few links (say less than 20) and we can represent G using lists which will bring us back into the gigabyte range.

Computing L from G is a large but tractable computation. L is an eigenvector of G and there is an iterative algorithm for computing L from G called the power method. The power method begins with an approximation for L and improves on each iteration. The rate of convergence to the true value of L is geometrically fast in terms of the parameter d. Therefore we can compute the long run behaviour of our random surfer.

The diagram below (available from Wikipedia) shows the PageRank analysis for a simple collection of pages. The arrows represent the link data and the pages are drawn in size relative to their PageRank importance. If we divided the number on each page by 100 this would be our long run probability vector L.

What did we learn?

Recall that at the beginning of the post I stated that we need to get beyond pining for data and start to think in terms of data, models and analysis (then meaning). If we now look at the PageRank algorithm we can break it down into

Data: Raw page linking represented in P
Model: The Google matrix G = d*P + (1-d)*R, a random perturbation of P
Analysis: The long run probabilities L of the random surfer.

It could be said that PageRank is one part brilliance and two parts daring. It is not obvious at all that L would produce informative page rankings. However the point is now moot and the wisdom of a random crowd of one has prevailed.

The absence of data has been a scapegoat for years in IT Risk. The real issue is that we don't know what data we want, we don't know what we would do with the data, and we don't know what results the data should produce. These are 3 serious strikes. It seems that everybody would just feel better if we had more data, but few people seem to know (or say) why. We are suffering from a severe and prolonged case of data envy.

In IT Risk we are stubbornly waiting for a data set that is self-modelling, self-analysing and self-explaining. We wish to bypass the modelling and analysis steps, hoping that meaning can be simply read off from the data itself. As if data were like one of those "Advance to Go" cards in Monopoly where we can skip over most of the board and just collect our $200. The problem is that we keep drawing cards that direct us to "Go Back Three spaces" or "Go to Jail".

Saturday, January 17, 2009

Google's Storm in a Teacup

The UK Times recently ran a story on the environmental impact of Google searches. The article asserted that two Google desktop searches (at 7g of CO2 each) had about the same carbon footprint as boiling a kettle for a cup of tea (at 15g of CO2). Following the link to the original story now shows that the article has been clarified - I would like to say corrected after a flurry of criticism, but its not that simple.

Google quickly replied that their estimated cost of a search query was 0.2g of CO2, considerably less than the 7g given by the Times. The authors state that driving an average car for one kilometre produces as many greenhouse gases as a thousand Google searches, under the current regime for EU standard for tailpipe emissions. The Times has later clarified that by search they meant an activity lasting several minutes, and not simply a single Google query. Other people have also produced CO2 estimates for search in the 1g - 10g range, but upon inspection, these estimates also assume search to be an activity that the user performs for 10 to 15 minutes rather than a service provided by Google. Your computer has a 40g - 80g footprint per hour simply due to being turned on and surfing around a bit.

Gartner has remarked that for the first time in 2007 greenhouse gases produced to power the Internet have surpassed the total emissions of the global airline industry. This is quite ominous since we normally don't think of airlines as being particularly clean or the Internet as being particularly dirty. But its all about power consumption produced by burning fossil fuels. Power is mainly used by personal computers, the network and data centres. The largest power consumer here are personal devices, and there is some debate over the order of second and third places. I think the network.

Part of the controversy of the original story was that the 7g search cost was attributed to young Harvard physicist Alex Wissner-Gross. The clarification to the Times article now includes a link to a new article by Wissner-Gross where he gives some opinions on CO2 emissions, but actually few details. Apparently his main work on Internet CO2 consumption is being reviewed by academic referees before formal publication, so we must wait for his detailed analysis. Oddly enough Wissner-Gross, who took umbrage at being attributed as the source for the 7g Google estimate, now openly states that the correct figure is 5g - 10g.

The original statements of the Times, and perhaps even those of Wissner-Gross, can be traced back to the work of Rolf Kersten, who presented a talk called Your CO2 Footprint when using the Internet at a German Ebay conference in 2007. His findings are summarized in the table below.

Web Service	CO2 Emission
One Google search	6.8 g
One EBAY auction	55 g
One blog post on blogs.sun.com	850 g
A SecondLife avatar, 24hrs „alive“ for one year	332 kgs

Yes that is kilograms for the SecondLife avatar! Hr. Kersten has recently changed his estimates in light of the recent events and technology improvements, and stated that he overestimated by a factor of 35:

I was wrong. Very wrong. Wrong by a factor of 35. Wrong even when you take into account that Moore's Law and Google engineers had 20 months to increase efficiency since my first guesstimate.

So now we have it: One Google Search produces as much CO2 as 10 seconds of breathing!

You can review the details of the calculation in a latter article. Hr. Kersten is now in agreement with the 0.2g figure given by Google.

So in the end its just more than 70 Google queries to consume the same energy boiling a a kettle.

No Tricks

Wednesday, May 26, 2010

Google could help enforce new German wireless protection law

Tuesday, May 18, 2010

Two universities rethink Gmail migration plans