Automating the Mining of the Deep Web

Mike Bazeley’s article: Diving deep into the Web, profiles a Bay Area start-up named Glenbrook Networks that is developing technology to crawl material that’s hidden in deep/invisible web databases. Specifically, their technology will be able to automatically complete online forms and then extract the data that’s returned on a results page.

Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for. Then, Glenbrook’s Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such as the location of their desired job, in the same way a human would.

I haven’t personally seen the Glenbrook technology in action but I’ve been reading about similar types of automated deep web database extraction for many years.

One company that was doing work in this area was WhizBang Labs whose technology was acquired by InXight in 2002.

More Reading
Here are two research papers that might be of interest:
+ On the Automatic Extraction of Data from the Hidden Web
PDF; 14 pages.
+ Testbed for Information Extraction from Deep Web

Some issues that quickly come to mind about the use of this type of technology are:
+ Legal. Extracting and repurposing all of raw info from a database, is it legal?
Lots of data remains hidden in public record databases. Is it legal to crawl and repurpose this data?
+ Server load.
+ Updates and Recrawl. Info in these databases can be very dynamic.
+ Will the data have the same usability/searchability?

Btw, on a somewhat related note. The article mentions that this technology could be used to find job listings on company web sites. It’s worth noting that although and don’t extract job listings from corportate “databases” THEY DO crawl job listings posted directly to company web sites. For example, recently announced that they’re now crawling the career pages of most of the Fortune 500 companies. At you’re able to limit your search to job postings from companies that appear on several business ranking lists.

Postscript: The article notes that Google provides access to bilbiographic records from OCLC’s Worldcat. Yahoo also offers access to this material. In fact, they even offer a special, co-branded version of the Yahoo Toolbar that allows you to get at this data quickly.

PPS: Other companies doing work in mining and providing access to the deep web include long-time player BrightPlanet and Deep Web Technologies whose technology powers the portal.

Related reading

interview with SEMrush CEO
facebook is a local search engine. Are you treating it like one?
17 best extensions and plugins that experienced SEOs use
Gillette video search trends