Peter Norvig (the Director of Research at Google) started off his ETech presentation with a diagram showing how things used to be (back in the old, old days … like 1994):
At the core, in the past, was the algorithm. Inputs were pretty simple (mouse clicks, keyboard entry). Outputs were equally simple (text, user interface). Data was used simply as a store of input and output. All of the effort and focus went into creating smart algorithms.
However the massive data sets that Google now has access to allows them to flip this model around. Rather than creating complex, elaborate, (and probably inaccurate) algorithms by hand they instead use a simple statistical model and let the data do the work.
He gave several examples. The most obvious is the Google spell checker which using this approach can guess what you might have meant, even where the words you’re looking for don’t appear in any dictionary (e.g. http://www.google.com/search?q=rowan+simson).
Another is their translation tool which can be trained to convert any text where there are enough examples to “learn” from. Ironically, the limiting factor now with this approach is not the software but the quality of the human translators used for training.
In each case being able to do this simply comes down to having enough data.
This is one of those ideas which is so obvious after you’ve seen it:
If you have lots of data the way you think about algorithms changes completely