Sunday, September 14, 2008

Why Gnip Will Displace Google

Whether search is 90% solved or whether the last 10% will take 90% of the effort, - either or both according to Marissa Meyer - there is a lot of improvement to be had in search. If you honestly want to find authoritative information about a topic that’s been over-SEOed like ‘ring tones’ or ‘mortgages’ – or you’re searching for a semantically challenging term like ‘bush’ - or you’re looking for something particularly esoteric – Google leaves you with a lot of cruft to wade through.

Particularly in technology and the Internet – there is no such thing as a permanent monopoly – eventually someone will challenge Google in core search and start taking their market share.
It sure won’t be Cuil – now that they are in self-destruct mode. It won’t be Powerset now that their assets have been assimilated. But it just may be GnipEric Marcoullier’s ping server to rule them all. In this recent interview with Om Malik, Eric humbly calls Gnip’s service ‘commodity work’ that takes away some logistical headaches for web service developers. But what Gnip is really doing is fundamentally changing the nature of aggregation and indexing on the web from a pull model to a push model.

Search engines today send out spiders to actively crawl the web and pull content into the index – at so great a cost in overhead that crawling is generally considered to be a powerful barrier to entry in the search market.

Contrastingly, if you subscribe to a blog, you get pushed a notification whenever that blog is updated – with a push model, there is no need to scan through every blog you like to read to find out which ones have been updated as a spider would. This push model provides enough of an advantage that webmasters will use Google Adwords & Adsense as a way to ping Google and get new content or sites indexed more quickly than simply waiting for the spider.

This is not a new idea – back in the circa 2000 era, visionary Seattle startup 360powered had the same idea – and even managed to perfect an interesting patent enumerating this architecture. Unfortunately, 360powered fell victim to the dot com crash and their IP ended up auctioned off in bankruptcy.

In a Gnip world, every website would have a feed – whenever content changes – the index gets pinged. Simultaneously, the overhead of crawling is distributed out to the edge, removing the burden from the search engine, and the delay for new content to get indexed goes to zero.
Gnip is nothing less than a fundamental cornerstone of the next generation of dominant search.


Eric Marcoullier said...

Matt --

Thanks a ton for such a positive post. I'll leave the grandiose predictions for other people and just remain happy if we continue to grow at our current pace. With the release of our data platform in a few weeks, which alleviates the need to build any sort of backend poller, I'm hoping that people will be able to wake up with killer ideas, use Gnip to aggregate relevant data and immediately begin focusing on the front end.

Cheers, amigo!

Anonymous said...

Once the search engine has the data, it still has to decide what about the page is important, and how to index it, and rank it, and allow searches, and present the results in a usable way, and show cached/highlighted versions of the pages, and allow lawyers to take down copyright violators, and ...

Out of the 20 things a search engine needs to do, "Gnip" removes the simplest one. If we became "a Gnip world", Google would have 4 guys write a Gnip receiver and call it a day, and a new search startup would have to write 19 other pieces of a search engine.

Gnip might make it marginally easier to displace Google, but you haven't shown why Google would be unable to profit from this, or why I would switch from Google because of it.

Adam MacBeth said...

I was just thinking yesterday about what push means for search. Along with the first commenter though, I don't see an obvious argument that push will do anything to displace Google.

Here are a few areas push may improve:

1. Efficiency. As you mention, it's much more efficient to receive notifications than to poll/spider. If Gnip or a similar service could do search at Google scale for a fraction of the cost you might be able to make an economic argument that they could undercut Google in pricing. But first you'll have to convince me that anyone will use Gnip for search.

2. Latency. Timely results are important in search, and push can certainly lead to fast update. Google seems to do fine here today with a combination of frequent polling for popular sites and pings from blogs and smaller sites.

3. Relevance. Aggregation puts the data that are important to a user in one place. This could imply these data are more relevant to a user (they may be in some case), but the needle in the haystack is not necessarily in the aggregated set.

Fundamentally push is a delivery mechanism so I think efficiency and latency are the obvious benefits. If push can help relevance that's great, but I don't know enough about what the data flowing through Gnip will look like to know whether it can really help.

Anonymous said...


::rushabh:: said...

"Contrastingly, if you subscribe to a blog, you get pushed a notification whenever that blog is updated"

This is incorrect - your RSS reader, be it a web reader or a client reader, pulls this by constantly polling the said rss feed and checking for updates. This is so much of a problem for big blog aggregators like FriendFeed that those guys actually proposed a way to do a "mass check". Still, note that this would still be a pull, not a push from the blog to FF or anything of that sort.

I do not know how much of a background you have in distributed systems, but the pull model, while somewhat inefficient, is far more resilient to faults than a push. The only place where pushes are much better is where you can afford to drop a few messages but its really important (latency wise) to get updates as fast as you can. Look at implementations of publish/subscribe systems which do exactly that, but these too are enormously complicated because they try very hard to never drop a message.

Finally, the best pusher is not necessarily the answer you're looking for. All it helps you do is reduce the latency from publishing to indexing / serving. Usually, the push/pull problem is not really the bottleneck - so your proposal, or conjecture, in effect solves the wrong problem.

Matt Brubeck said...

I'm skeptical. Dave Winer's launched in 1999 and started centralized ping-based aggregation back in 2001. Dave Sifry's has five years of experience doing ping-based search, and it's still going strong - but it shows no signs of displacing Google.

I see several obstacles to taking over the search market in this way:

1. For the majority of searches, I want the most relevant content, not the most recent content. Indexing something 30 minutes sooner than Google doesn't add any value to me as a user.

2. Too many sites to ping. Is my weblog even set up to ping Gnip? With a dozen major ping aggregators, it's hard for a weblog owner to get full coverage. There are multiplexers that promise to keep track of all the ping sites for you... but now there are a dozen of those to choose from, too.

3. No matter how easy you make it to ping, most sites just won't do it.

4. If your adoption rate ever did skyrocket, you'd have a trickier scalability problem than passive indexers do.

5. As you noted, Google already has its own ping-based systems, plus intelligent prediction of polling intervals, plus opt-in "sitemaps" for finer-grained control. Google has no problem indexing content less than five minutes after it's published. And they have plenty more engineering muscle to throw at this problem if they decide it's important. (Trivia: One of my friends here at Google Kirkland works on the "fresh search" system.) So I'm not sure how this will give Gnip - or anyone else - a unique advantage over the incumbent.