Cuil, the search engine company that has been in stealth mode, launched today. The big news is that Cuil claims to index 3 times as many web pages as Google, and 10 times as many as Microsoft. Compared to other search startups that attempt to build a differentiated search experience on small indices, the extreme example of which was Powerset which launched its product on a puny little Wikipedia-only-index, Cuil has chosen to start big and make that their USP.
Clearly, a claim about the biggest index on the web wouldn't go unexamined, and sure enough Techcrunch jumped right into the debate with this evaluation. Unfortunately, the analysis on this post is not very insightful.
First up, the TC post compares searches on the keywords like "dog" and "apple" between Google and Cuil. Their argument is that Google seems to have a bigger index because the number of results returned on Google is seemilgly much higher than on Cuil. This argument is wrong because:
- Google and Cuil both produce an estimate of the number of search results, and that estimate is often completely off the mark. As one of the comments on the TC post pointed out, Google seems to "lie" about the number: even though Google claims 1.5 billion results for the query "france", it doesn't appear possible to retrieve more than 90,000 results. Check out this SERP to confirm that.
- Even if Google and Cuil's numbers were accurate, it would not be an indication of the size of their index. The search results that you see are based on a search algorithm implemented on TOP of the index. Now, Cuil could have the biggest index but the stupidest search algorithm (which they don't, but just for argument's sake..), and they could end up fetching very few good results from their index. Or, completely the opposite: they could have the biggest index, and they might CHOOSE (for good or bad reasons) to limit the number of results based on some relevance criteria, and end up producing fewer results.
So then how does one compare the size of the index? Well, as a user you simply can not. The index constitutes what search engineers call the "backend" of search, i.e., all the infrastructure that doesn't deal with the user query. The "frontend" of the search engine is the part that queries the index with the user's query, and then ranks the results and decides how many of the ranked results to show to the user. Essentially, the frontend is the ranking and relevance algo, the backend is the data (index and other data derived from a web crawl by running algorithms that don't have the notion of a query). If the data is bad, the search results will be bad irrespective of what algo you use. But if the data is great, the search results might still be bad based on the quality of the ranking/relevance algos.
If Cuil claims to have indexed 120 billion docs, I'd simply believe that number - after all, they don't have a good reason to lie. However, to make the comparison between Google and Cuil's indices truly meaningful, we'd need to understand a few more things:
- First, we'd need to know the rules these search engines use to canonicalize their URLs and identify duplicate content: a 120 billion doc index might have no more additional information than a 40 billion doc index if the former contains duplicate content either due to bad canonicalization or due to the inability to identify duplicate content when the URLs are different.
- Secondly, it would be useful to understand how many URLs did Google and Cuil look at before picking their top 40-120 billion URLs that they decided to index. Google released its number a few days back in a blog post: they look at close to a trillion unique URLs before picking the top 40 billion. Clearly, if Cuil is picking its 120 billion from a smaller number, it might be picking the "wrong" 120 billion, and the quality of the index could still be poorer even though its bigger.
- Finally, it would also help to know the methods by which Google and Cuil pick their top 40-12 billion URLs. Knowing Google, you'd suspect that some of the elements of this selection would involve the page rank as well as many other "signals" that Google has built over time to detect spam, porn, and other junk. What's Cuil using? Do they have their own version of popularity? Perhaps not, as this snippet from their about page indicates: "Rather than rely on superficial popularity metrics, Cuil searches for and ranks pages based on their content and relevance. " Basically, if Cuil doesn't have the right signals to pick the most relevant documents to index from among the documents it has crawled, its index might end up being of poorer quality than Google's despite being bigger than Google.
Unfortunately, information on these aspects is unlikely to be forthcoming from either Cuil or Google. And in the absence of such information, it's difficult (or impossible) as an outsider to compare the raw indices of Google and Cuil; the only comparison that makes sense is on who has the better search engine, and Google clearly wins that battle.
All I can say today is that:
- Cuil has not yet built a better search engine than Google (and here I agree with all the comments on TC)
- Cuil might have built a bigger index than Google, but that it's ranking and relevance algos aren't helping surface a lot of content from that index, and this is creating the impression that its index isn't all that big.
Crawling the whole web and building a very large scale index (even if it's not bigger than Google's) and building a reasonable (though not yet good enough) search algo on top of that aren't problems that any other search startup has even attempted to tackle, and it's encouraging to see Cuil going after these big problems.
I am also interested to learn more about their claim around lowering the cost of building a search engine: "Much of the secret sauce of Cuil is in the way they index the web and handle actual queries by users. Both are costly to scale, and Cuil claims to have found a way to massively reduct those costs. That allows them to run the search engine a lot cheaper, even at Google-scale should it ever reach that point". That could be a game changer by itself.
All in all, Cuil comes across as a startup worth watching out for.
This would all seem fine and dandy expect they seem to have indexed everything except for their own site. I mean searching for cuil should return something about the search engine , right? Then there's the problem that for tons of search queries you simply get a nothing found page. WTF? The only thing these people are good at is PR.
Posted by: Bob Vader III | July 28, 2008 at 06:06 PM