Using large data sets

Peter Norvig (the Director of Research at Google) started off his ETech presentation with a diagram showing how things used to be (back in the old, old days … like 1994):

Data

At the core, in the past, was the algorithm. Inputs were pretty simple (mouse clicks, keyboard entry). Outputs were equally simple (text, user interface). Data was used simply as a store of input and output. All of the effort and focus went into creating smart algorithms.

However the massive data sets that Google now has access to allows them to flip this model around. Rather than creating complex, elaborate, (and probably inaccurate) algorithms by hand they instead use a simple statistical model and let the data do the work.

He gave several examples. The most obvious is the Google spell checker which using this approach can guess what you might have meant, even where the words you’re looking for don’t appear in any dictionary (e.g. http://www.google.com/search?q=rowan+simson).

Another is their translation tool which can be trained to convert any text where there are enough examples to “learn” from. Ironically, the limiting factor now with this approach is not the software but the quality of the human translators used for training.

In each case being able to do this simply comes down to having enough data.

This is one of those ideas which is so obvious after you’ve seen it:

If you have lots of data the way you think about algorithms changes completely

ETech

I had a blast at ETech in San Diego.

At every turn there was somebody interesting to meet, chat with or listen to.

I took a ton of notes, with good intentions of turning them into a blog post, but that’s not going to work out.

Instead here are four random/great quotes from the week:

  • Steve Cousins, responding to a question about his proposed open source platform for personal robots: “In the civilian robotics area we don’t really use the words ‘killer application'”
  • “Exercise – the poor mans plastic surgery” – from Kathy Sierra slideware.
  • Nat Torkington, via Twitter, in response to an Ignite presentation by Noel Dickover from the US Department of Defense: “General rule for ETech speakers: ‘decreasing the kill chain’ tends not to be the goal of the average attendee.”
  • Tim Ferris, explaining why his PDA doesn’t have an internet connection: “I don’t trust an inbox in my pocket any more than I trust dark chocolate in my house.”

And, one quirky website:

Finally if you have six and a half minutes to spare, you should check out Saul Griffith’s Ignite presentation on Howtoons (which I can recommend if you have some young kids to entertain) from the first night of the conference:

http://youtube.com/watch?v=wyLHOTwvzf4

Saul is a super smart but friendly Aussie, who also did a keynote later in the conference on Energy Literacy, which was also excellent. I really hope this will be available online sometime soon so I can link to it. In the meantime this interview gives you a bit of the flavour:

http://youtube.com/watch?v=wbwxF47x5ss