Using large data sets

Peter Norvig (the Director of Research at Google) started off his ETech presentation with a diagram showing how things used to be (back in the old, old days … like 1994):


At the core, in the past, was the algorithm. Inputs were pretty simple (mouse clicks, keyboard entry). Outputs were equally simple (text, user interface). Data was used simply as a store of input and output. All of the effort and focus went into creating smart algorithms.

However the massive data sets that Google now has access to allows them to flip this model around. Rather than creating complex, elaborate, (and probably inaccurate) algorithms by hand they instead use a simple statistical model and let the data do the work.

He gave several examples. The most obvious is the Google spell checker which using this approach can guess what you might have meant, even where the words you’re looking for don’t appear in any dictionary (e.g.

Another is their translation tool which can be trained to convert any text where there are enough examples to “learn” from. Ironically, the limiting factor now with this approach is not the software but the quality of the human translators used for training.

In each case being able to do this simply comes down to having enough data.

This is one of those ideas which is so obvious after you’ve seen it:

If you have lots of data the way you think about algorithms changes completely

9 thoughts on “Using large data sets”

  1. Nice.

    From what I can tell, Bayesian statistics is at the heart of much of Google’s success. They’ll be using it for fraud/spam detection (what’s the probablility this page is spam given it has the following attributes?), Adsense rankings (the dreaded quality score), behavioural targeting, spelling correction, etc etc.

    It’s both scalable and adaptive – you just need to feed it lots of new data and have your engineers tweak the algo every now in then as they learn more about attributes that are good discriminators… and even that process is informed mainly by data mining.

    If you want to see something crazy cool the folks at Accenture labs have done with Bayesian classifiers and public auction data, have a look at the paper linked here:

  2. I reckon one of the most interesting data sets Google are compiling is based on the free IVR directory service they’ve built for the US (Goog411).

    They’re compiling a huge set of voice samples which must be a powerful way to train their speech recognition…which in turn could power the next generation of contextual advertising. Adsense for video is kinda hit and miss at the moment, but imagine if was as good as the targeting for regular site ads.

  3. Ben said…
    Bayesian statistics is at the heart of much of Google’s success.

    Umm, I doubt it. If they use it, then it is being applied to low level application. Also, if they use it, then it is very likely to be a hybrid bayes (with neural network, support vector machine, rough sets, etc…) that they developed. Hybridising bayes with other algorithms decreases its classification error rate. That means that its hybrid version is more robust (and outperforms) than its stand-alone version.

    The latest state-of-the-art techniques that have emerged from researches today are skewing heavily to numerical algebra algorithms. Some of these algorithms were highlighted in the Workshop on Algorithms for Modern Massive Data Sets (MMDS 2006) conference in 2006 at Stanford where reps from Google, Microsoft, Yahoo, IBM and others attended.

    The interesting thing I have noted about some of these emerging algorithms that were highlighted at the MMDS-2006 workshop is the resurrection of the 100 year old Tensor Calculus algorithms, which is multi-dimensional algebra. Einstein developed his theory of general relativity using Tensor Calculus in 1916 but Tensor wasn’t much used in any other fields except theoretical physics, since then. Data analysts and mathematicians have just found out recently of its huge potential in applications to the analysis of massive datasets.

    Current search engines be it link-based (Google PageRank) or content-based (word document search) use 2 dimensional matrix (rows & columns of frequency numerical data). In PageRank, the input data is a 2D matrix of outward links by inward links, similar to content-based search, where the input is a 2D matrix of words-by-documents frequency. With the emergent of multi-linear algebra technique as Tensor calculus, the input can be 3D, 4D, 5D matrix , etc… A 4D matrix could be in-links by out-links by words by documents (pages). Tensor can solve this at once instead of the current techniques that solves two 2D matrices by 2 different algorithms. Solving different metrics (dimension or attributes) at once is more robust than solving each separately.

    I wouldn’t be surprised if Google is already working in bringing Tensor to its search engine (or perhaps they have). I am sure that others, such as Microsoft, Yahoo , etc,… must be on to it as well. Interesting future for web search is coming.

  4. Hey Falafulu.

    You may well be right, considering I’ve never even heard of Tensor calculus! (thanks for the links, by the way). I guess the only folks who really know Google’s secret sauce are the Google engineers, and I doubt they’ll be disclosing their methods in detail any time soon.

    Most of the applications I was thinking of when I wrote the comment above were fraud or quality detection ones, and it seems to me that this is where they would be applying bayesian stats approaches (with or without hybridization). IMO, the fact that Google is able to weed out a good deal of the crap from its results despite the constant gaming that is going on is a significant contributor to its success.

    Returning results to a search on the fly (i.e., picking best matches, assigning rank based on quality score etc.) is a different kettle of fish – I have no idea what they would be doing there. But if there is any opportunity to represent the universe of web pages in a similar way to how Einstein represented the physical universe, that’s likely to be too tempting for the geeks at Google to ignore!

  5. Hello! I want to ask you if you know any dataset of data fraud, like credit card transactions or medical health fraud, because i want to test my algorithm to detect fraud and I dont find a dataset that i could use..
    Could you help me?


Comments are closed.