Invisible Web Gets Deeper
I've written before about the "invisible web," information that search engines cannot or refuse to index because it is locked up within databases. Now a new survey has made an attempt to measure how much information exists outside of the search engines' reach. The company behind the survey is also offering up a solution for those who want tap into this "hidden" material.
The study, conducted by search company BrightPlanet, estimates that the inaccessible part of the web is about 500 times larger than what search engines already provide access to. To put that another way, Google currently claims to have indexed or know about 1 billion web pages, making it the largest crawler-based search engine, based on reported numbers. Using Google as a benchmark, that means BrightPlanet would estimate there are about 500 billion pages of information available on the web, and only 1/500 of that information can be reached via traditional search engines.
That sounds terrible, but as I've commented numerous times before, the size of a search engine does not necessarily equate to its relevancy or usefulness. Nevertheless, having the ability to search across some of this hidden content in focused ways would be very useful.
For example, assume you wanted to do a trademark search against databases in various parts of the world. It's possible to do this via a web site at the US, and let's assume that the case was true for many other countries. A focused search of the invisible web would let you send a query to all these specialized search engines. They would check their own databases, then send back a response to you.
To date, meta search tools like this have been few and far between. Instead, efforts have focused on identifying invisible web content and then directing you to those particular places. For example, at Intelliseek's InvisibleWeb.com web site, you could search for "trademarks," then be shown all the relevant sites that might have hidden information of use. Next, you then have to search these sites individually.
BrightPlanet now has its own resource locator tool called CompletePlanet. Similar to InvisibleWeb.com, you can search there to locate databases offering hidden information. However, BrightPlanet is also going a step beyond by offering its LexiBot search tool. At the moment, this tool looks at your search, then selects the most relevant of 600 different invisible web resources and forwards your query to them. After that, as with traditional meta search tools, your information will be returned. The plan is for this tool to ultimate query against all 40,000 sources shortly, then expand to all 100,000 significant invisible web sites that BrightPlanet estimates exists.
Don't expect a web based version of LexiBot to be coming. The time to process the queries from many people, send them out to thousands of search engines and bring back responses would be massive, BrightPlanet says.
"It would be impossible to handle this as a server side," said Mike Bergman, chairman of BrightPlanet. "We're providing the translation [using LexiBot to talk to deep web sources”, but the actual work and heavy lifting is being done on the client site."
An Associate Press article (listed below) about the survey circulated widely last week. It makes mention of the "deep web" rather than the invisible web, so I though it worthwhile to briefly provide some definitions.
I don't know who initially coined the term invisible web, but it has been used for well over a year to describe content that search engines can't access. Intelliseek has embraced the term and incorporated as the name of its InvisibleWeb.com site.
BrightPlanet is using the term "deep web" as a synonym for "invisible web." The two terms refer to the same concept. BrightPlanet also describes the content currently accessible to search engines as the "surface web."
Now let me make one final distinction. There is also what I'd describe as a "shallow web." These are ordinary web pages that are served out of database systems such as Cold Fusion or Lotus Domino. They are essentially normal, static pages and belong as part of the surface web. However, search engines are generally fearful of indexing these because its easy for them to accidentally index the same page over and over, because the URL might be slightly different due to different features of the dynamic delivery system.
More and more of the surface web is slipping into the shallow web, as webmasters make use of more advanced document delivery tools. BrightPlanet's solution won't cover this problem, nor have the search engines themselves been very responsive to it. So while the surface is being better covered, and there's no hope to plumb the depths of the deep web, the growing shallow web remains a worrisome issue.
The company behind CompletePlanet and LexiBot, below. The company also aims to offer search solutions to vertical portals.
The Deep Web: Surfacing Hidden Value
BrightPlanet, July 2000
BrightPlanet's analysis of the size of the deep web.
Currently lists about 20,000 deep web resources.
Intelliseek's catalog of deep web resources.
Search Engine Sizes
A look at the current coverage of the surface web, based on reported sizes, along with links to many articles about size issues. Also has links to my recent size test and a more comprehensive one by Search Engine Showdown.
Study: Web Bigger Than We Think
Associated Press, July 27, 2000
Another take on the deep web survey, with various comments.
Invisible Web & Database Search Engines
A few more links to invisible web resources and information.
Searching the Invisible Web
About.com Web Search Guide, July 26, 2000
Researcher Gary Price and search commentator Chris Sherman recently addressed the annual conference of the Association of Professional Researchers for Advancement on the issue of the invisible web. The link above provides an outline of their points and tips on navigating the deep web more easily.
Internet Exceeds 2 Billion Pages
Cyveillance, July 10, 2000
A new study estimating the size of the surface web to be 2 billion pages.
Search Engines and Dynamic Web Pages
If you have a surface web problem like I've described above, where search engines can't access your web site, this page discusses the issue in more depth and provides some workarounds. It's available to paid Search Engine Watch "site subscribers." If you aren't one, click here to learn more about the benefits you get for supporting the site with a subscription.
The Original Search Marketing Event is Back!
SES Denver (Oct 16) offers an intense day of learning all the critical aspects of search engine optimization (SEO) and paid search advertising (PPC). The mission of SES remains the same as it did from the start - to help you master being found on search engines. Early Bird rates extended through Sept 19. Register today!