Comments on Mattishness: Why Gnip Will Displace Google

I'm skeptical. Dave Winer's weblogs.com launched ...

2008-09-15T08:48:00.000-07:00

I'm skeptical. Dave Winer's weblogs.com launched in 1999 and started centralized ping-based aggregation back in 2001. Dave Sifry's Technorati.com has five years of experience doing ping-based search, and it's still going strong - but it shows no signs of displacing Google.

I see several obstacles to taking over the search market in this way:

1. For the majority of searches, I want the most relevant content, not the most recent content. Indexing something 30 minutes sooner than Google doesn't add any value to me as a user.

2. Too many sites to ping. Is my weblog even set up to ping Gnip? With a dozen major ping aggregators, it's hard for a weblog owner to get full coverage. There are multiplexers that promise to keep track of all the ping sites for you... but now there are a dozen of those to choose from, too.

3. No matter how easy you make it to ping, most sites just won't do it.

4. If your adoption rate ever did skyrocket, you'd have a trickier scalability problem than passive indexers do.

5. As you noted, Google already has its own ping-based systems, plus intelligent prediction of polling intervals, plus opt-in "sitemaps" for finer-grained control. Google has no problem indexing content less than five minutes after it's published. And they have plenty more engineering muscle to throw at this problem if they decide it's important. (Trivia: One of my friends here at Google Kirkland works on the "fresh search" system.) So I'm not sure how this will give Gnip - or anyone else - a unique advantage over the incumbent.

"Contrastingly, if you subscribe to a blog, you ge...

2008-09-14T23:34:00.000-07:00

"Contrastingly, if you subscribe to a blog, you get pushed a notification whenever that blog is updated"

This is incorrect - your RSS reader, be it a web reader or a client reader, pulls this by constantly polling the said rss feed and checking for updates. This is so much of a problem for big blog aggregators like FriendFeed that those guys actually proposed a way to do a "mass check". Still, note that this would still be a pull, not a push from the blog to FF or anything of that sort.

I do not know how much of a background you have in distributed systems, but the pull model, while somewhat inefficient, is far more resilient to faults than a push. The only place where pushes are much better is where you can afford to drop a few messages but its really important (latency wise) to get updates as fast as you can. Look at implementations of publish/subscribe systems which do exactly that, but these too are enormously complicated because they try very hard to never drop a message.

Finally, the best pusher is not necessarily the answer you're looking for. All it helps you do is reduce the latency from publishing to indexing / serving. Usually, the push/pull problem is not really the bottleneck - so your proposal, or conjecture, in effect solves the wrong problem.

Rubbish.

2008-09-14T17:00:00.000-07:00

Rubbish.

I was just thinking yesterday about what push mean...

2008-09-14T16:01:00.000-07:00

I was just thinking yesterday about what push means for search. Along with the first commenter though, I don't see an obvious argument that push will do anything to displace Google.

Here are a few areas push may improve:

1. Efficiency. As you mention, it's much more efficient to receive notifications than to poll/spider. If Gnip or a similar service could do search at Google scale for a fraction of the cost you might be able to make an economic argument that they could undercut Google in pricing. But first you'll have to convince me that anyone will use Gnip for search.

2. Latency. Timely results are important in search, and push can certainly lead to fast update. Google seems to do fine here today with a combination of frequent polling for popular sites and pings from blogs and smaller sites.

3. Relevance. Aggregation puts the data that are important to a user in one place. This could imply these data are more relevant to a user (they may be in some case), but the needle in the haystack is not necessarily in the aggregated set.

Fundamentally push is a delivery mechanism so I think efficiency and latency are the obvious benefits. If push can help relevance that's great, but I don't know enough about what the data flowing through Gnip will look like to know whether it can really help.

Once the search engine has the data, it still has ...

2008-09-14T15:07:00.000-07:00

Once the search engine has the data, it still has to decide what about the page is important, and how to index it, and rank it, and allow searches, and present the results in a usable way, and show cached/highlighted versions of the pages, and allow lawyers to take down copyright violators, and ...

Out of the 20 things a search engine needs to do, "Gnip" removes the simplest one. If we became "a Gnip world", Google would have 4 guys write a Gnip receiver and call it a day, and a new search startup would have to write 19 other pieces of a search engine.

Gnip might make it marginally easier to displace Google, but you haven't shown why Google would be unable to profit from this, or why I would switch from Google because of it.

Matt -- Thanks a ton for such a positive post. I'...

2008-09-14T14:03:00.000-07:00

Matt --

Thanks a ton for such a positive post. I'll leave the grandiose predictions for other people and just remain happy if we continue to grow at our current pace. With the release of our data platform in a few weeks, which alleviates the need to build any sort of backend poller, I'm hoping that people will be able to wake up with killer ideas, use Gnip to aggregate relevant data and immediately begin focusing on the front end.

Cheers, amigo!