<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Using large data sets</title>
	<atom:link href="http://rowansimpson.com/2008/04/16/using-large-data-sets/feed/" rel="self" type="application/rss+xml" />
	<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/</link>
	<description>... is pining for the fjords</description>
	<pubDate>Sun, 06 Jul 2008 06:20:44 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
		<item>
		<title>By: Ten years, or less &#171; Rowan Simpson</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7990</link>
		<dc:creator>Ten years, or less &#171; Rowan Simpson</dc:creator>
		<pubDate>Mon, 30 Jun 2008 21:22:56 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7990</guid>
		<description>[...]   Published July 1, 2008   Software Development , Technology       Peter Norvig, who I&#8217;ve written about here before, has a number of really interesting articles on his [...]</description>
		<content:encoded><![CDATA[<p>[...]   Published July 1, 2008   Software Development , Technology       Peter Norvig, who I&#8217;ve written about here before, has a number of really interesting articles on his [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Human-Like Memory Capabilities &#171; Sri Spot</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7965</link>
		<dc:creator>Human-Like Memory Capabilities &#171; Sri Spot</dc:creator>
		<pubDate>Fri, 20 Jun 2008 23:53:24 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7965</guid>
		<description>[...] Second is the Bayesian association of &#8220;related&#8221; keywords (ex. &#8220;nuclear&#8221; is related to &#8220;radioactive&#8221;) based on mining human-generated content. See http://rowansimpson.com/2008/04/16/using-large-data-sets/ [...]</description>
		<content:encoded><![CDATA[<p>[...] Second is the Bayesian association of &#8220;related&#8221; keywords (ex. &#8220;nuclear&#8221; is related to &#8220;radioactive&#8221;) based on mining human-generated content. See <a href="http://rowansimpson.com/2008/04/16/using-large-data-sets/" rel="nofollow">http://rowansimpson.com/2008/04/16/using-large-data-sets/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Got Your Number &#171; Rowan Simpson</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7920</link>
		<dc:creator>Got Your Number &#171; Rowan Simpson</dc:creator>
		<pubDate>Thu, 05 Jun 2008 20:40:29 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7920</guid>
		<description>[...] this comment from Charles on my &#8220;Using large data sets&#8221; post: &#8220;I reckon one of the most interesting data sets Google are compiling is based on the free [...]</description>
		<content:encoded><![CDATA[<p>[...] this comment from Charles on my &#8220;Using large data sets&#8221; post: &#8220;I reckon one of the most interesting data sets Google are compiling is based on the free [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7690</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Thu, 17 Apr 2008 20:33:56 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7690</guid>
		<description>Hey Falafulu.

You may well be right, considering I've never even heard of Tensor calculus! (thanks for the links, by the way).  I guess the only folks who really know Google's secret sauce are the Google engineers, and I doubt they'll be disclosing their methods in detail any time soon.

Most of the applications I was thinking of when I wrote the comment above were fraud or quality detection ones, and it seems to me that this is where they would be applying bayesian stats approaches (with or without hybridization).  IMO, the fact that Google is able to weed out a good deal of the crap from its results despite the constant gaming that is going on is a significant contributor to its success.  

Returning results to a search on the fly (i.e., picking best matches, assigning rank based on quality score etc.) is a different kettle of fish - I have no idea what they would be doing there.  But if there is any opportunity to represent the universe of web pages in a similar way to how Einstein represented the physical universe, that's likely to be too tempting for the geeks at Google to ignore!</description>
		<content:encoded><![CDATA[<p>Hey Falafulu.</p>
<p>You may well be right, considering I&#8217;ve never even heard of Tensor calculus! (thanks for the links, by the way).  I guess the only folks who really know Google&#8217;s secret sauce are the Google engineers, and I doubt they&#8217;ll be disclosing their methods in detail any time soon.</p>
<p>Most of the applications I was thinking of when I wrote the comment above were fraud or quality detection ones, and it seems to me that this is where they would be applying bayesian stats approaches (with or without hybridization).  IMO, the fact that Google is able to weed out a good deal of the crap from its results despite the constant gaming that is going on is a significant contributor to its success.  </p>
<p>Returning results to a search on the fly (i.e., picking best matches, assigning rank based on quality score etc.) is a different kettle of fish - I have no idea what they would be doing there.  But if there is any opportunity to represent the universe of web pages in a similar way to how Einstein represented the physical universe, that&#8217;s likely to be too tempting for the geeks at Google to ignore!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Falafulu Fisi</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7689</link>
		<dc:creator>Falafulu Fisi</dc:creator>
		<pubDate>Thu, 17 Apr 2008 09:22:07 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7689</guid>
		<description>Ben said...
&lt;i&gt;Bayesian statistics is at the heart of much of Google’s success.&lt;/i&gt;

Umm, I doubt it. If they use it, then it is being applied to low level application. Also, if they use it, then it is very likely to be a hybrid bayes (with neural network, support vector machine, rough sets, etc...) that they developed.  Hybridising bayes with other algorithms decreases its classification error rate. That means that its hybrid version is more robust (and outperforms) than its stand-alone version.

The latest state-of-the-art techniques that have emerged from researches today are skewing heavily to numerical algebra algorithms. Some of these algorithms were highlighted in the &lt;a href="http://www.stanford.edu/group/mmds/mmds2006.html" rel="nofollow"&gt;Workshop on Algorithms for Modern Massive Data Sets&lt;/a&gt; (MMDS 2006) conference in 2006 at Stanford where reps from Google, Microsoft, Yahoo, IBM and others attended.

The interesting thing I have noted about some of these emerging algorithms that were highlighted at the MMDS-2006 workshop is the resurrection of the 100 year old &lt;a href="http://en.wikipedia.org/wiki/Tensor" rel="nofollow"&gt;Tensor Calculus&lt;/a&gt; algorithms, which is multi-dimensional algebra.  Einstein developed his theory of general relativity using &lt;i&gt;Tensor Calculus&lt;/i&gt; in 1916 but Tensor  wasn't much used in any other fields except theoretical physics, since then. Data analysts and mathematicians have just found out recently of its huge potential in  applications to the analysis of massive datasets.

Current search engines be it link-based (Google PageRank) or content-based (word document search) use  2 dimensional matrix (rows &#38; columns of frequency numerical data). In PageRank, the input data is a 2D matrix of  outward links by inward links, similar to content-based search, where the input is a 2D matrix of words-by-documents frequency. With the emergent of multi-linear algebra technique as Tensor calculus, the input can be 3D, 4D, 5D matrix , etc...  A 4D matrix could be  in-links by out-links by words by documents (pages). Tensor can solve this at once instead of the  current techniques that solves two 2D matrices by 2 different algorithms. Solving different metrics (dimension or attributes) at once is more robust than solving each separately.

I wouldn't be surprised if Google is already working in bringing Tensor to its search engine (or perhaps they have). I am sure that others, such as Microsoft, Yahoo , etc,... must be on to it as well. Interesting future for web search is coming.</description>
		<content:encoded><![CDATA[<p>Ben said&#8230;<br />
<i>Bayesian statistics is at the heart of much of Google’s success.</i></p>
<p>Umm, I doubt it. If they use it, then it is being applied to low level application. Also, if they use it, then it is very likely to be a hybrid bayes (with neural network, support vector machine, rough sets, etc&#8230;) that they developed.  Hybridising bayes with other algorithms decreases its classification error rate. That means that its hybrid version is more robust (and outperforms) than its stand-alone version.</p>
<p>The latest state-of-the-art techniques that have emerged from researches today are skewing heavily to numerical algebra algorithms. Some of these algorithms were highlighted in the <a href="http://www.stanford.edu/group/mmds/mmds2006.html" rel="nofollow">Workshop on Algorithms for Modern Massive Data Sets</a> (MMDS 2006) conference in 2006 at Stanford where reps from Google, Microsoft, Yahoo, IBM and others attended.</p>
<p>The interesting thing I have noted about some of these emerging algorithms that were highlighted at the MMDS-2006 workshop is the resurrection of the 100 year old <a href="http://en.wikipedia.org/wiki/Tensor" rel="nofollow">Tensor Calculus</a> algorithms, which is multi-dimensional algebra.  Einstein developed his theory of general relativity using <i>Tensor Calculus</i> in 1916 but Tensor  wasn&#8217;t much used in any other fields except theoretical physics, since then. Data analysts and mathematicians have just found out recently of its huge potential in  applications to the analysis of massive datasets.</p>
<p>Current search engines be it link-based (Google PageRank) or content-based (word document search) use  2 dimensional matrix (rows &amp; columns of frequency numerical data). In PageRank, the input data is a 2D matrix of  outward links by inward links, similar to content-based search, where the input is a 2D matrix of words-by-documents frequency. With the emergent of multi-linear algebra technique as Tensor calculus, the input can be 3D, 4D, 5D matrix , etc&#8230;  A 4D matrix could be  in-links by out-links by words by documents (pages). Tensor can solve this at once instead of the  current techniques that solves two 2D matrices by 2 different algorithms. Solving different metrics (dimension or attributes) at once is more robust than solving each separately.</p>
<p>I wouldn&#8217;t be surprised if Google is already working in bringing Tensor to its search engine (or perhaps they have). I am sure that others, such as Microsoft, Yahoo , etc,&#8230; must be on to it as well. Interesting future for web search is coming.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Charles</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7686</link>
		<dc:creator>Charles</dc:creator>
		<pubDate>Wed, 16 Apr 2008 00:42:17 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7686</guid>
		<description>I reckon one of the most interesting data sets Google are compiling is based on the free IVR directory service they've built for the US (Goog411).

They're compiling a huge set of voice samples which must be a powerful way to train their speech recognition...which in turn could power the next generation of contextual advertising. Adsense for video is kinda hit and miss at the moment, but imagine if was as good as the targeting for regular site ads.</description>
		<content:encoded><![CDATA[<p>I reckon one of the most interesting data sets Google are compiling is based on the free IVR directory service they&#8217;ve built for the US (Goog411).</p>
<p>They&#8217;re compiling a huge set of voice samples which must be a powerful way to train their speech recognition&#8230;which in turn could power the next generation of contextual advertising. Adsense for video is kinda hit and miss at the moment, but imagine if was as good as the targeting for regular site ads.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kristin</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7685</link>
		<dc:creator>Kristin</dc:creator>
		<pubDate>Wed, 16 Apr 2008 00:18:27 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7685</guid>
		<description>Interesting post! Thanks.

"...they instead use a simple statistical model and let the data do the work."

Another tool needing lots of data is Google Trends:
http://www.google.com/trends?q=trademe%2C+trade+and+exchange&#38;ctab=0&#38;geo=NZ&#38;geor=all&#38;date=all&#38;sort=0</description>
		<content:encoded><![CDATA[<p>Interesting post! Thanks.</p>
<p>&#8220;&#8230;they instead use a simple statistical model and let the data do the work.&#8221;</p>
<p>Another tool needing lots of data is Google Trends:<br />
<a href="http://www.google.com/trends?q=trademe%2C+trade+and+exchange&amp;ctab=0&amp;geo=NZ&amp;geor=all&amp;date=all&amp;sort=0" rel="nofollow">http://www.google.com/trends?q=trademe%2C+trade+and+exchange&amp;ctab=0&amp;geo=NZ&amp;geor=all&amp;date=all&amp;sort=0</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://rowansimpson.com/2008/04/16/using-large-data-sets/#comment-7684</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Tue, 15 Apr 2008 23:58:37 +0000</pubDate>
		<guid isPermaLink="false">http://rowan.wordpress.com/?p=373#comment-7684</guid>
		<description>Nice.  

From what I can tell, Bayesian statistics is at the heart of much of Google's success.  They'll be using it for fraud/spam detection (what's the probablility this page is spam given it has the following attributes?), Adsense rankings (the dreaded quality score), behavioural targeting, spelling correction, etc etc.  

It's both scalable and adaptive - you just need to feed it lots of new data and have your engineers tweak the algo every now in then as they learn more about attributes that are good discriminators... and even that process is informed mainly by data mining.

If you want to see something crazy cool the folks at Accenture labs  have done with Bayesian classifiers and public auction data, have a look at the paper linked here:
http://www.accenture.com/NR/exeres/F0469E82-E904-4419-B34F-88D4BA53E88E.htm</description>
		<content:encoded><![CDATA[<p>Nice.  </p>
<p>From what I can tell, Bayesian statistics is at the heart of much of Google&#8217;s success.  They&#8217;ll be using it for fraud/spam detection (what&#8217;s the probablility this page is spam given it has the following attributes?), Adsense rankings (the dreaded quality score), behavioural targeting, spelling correction, etc etc.  </p>
<p>It&#8217;s both scalable and adaptive - you just need to feed it lots of new data and have your engineers tweak the algo every now in then as they learn more about attributes that are good discriminators&#8230; and even that process is informed mainly by data mining.</p>
<p>If you want to see something crazy cool the folks at Accenture labs  have done with Bayesian classifiers and public auction data, have a look at the paper linked here:<br />
<a href="http://www.accenture.com/NR/exeres/F0469E82-E904-4419-B34F-88D4BA53E88E.htm" rel="nofollow">http://www.accenture.com/NR/exeres/F0469E82-E904-4419-B34F-88D4BA53E88E.htm</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>
