Mike Bazeley’s article: Diving deep into the Web, profiles a Bay Area start-up named Glenbrook Networks that is developing technology to crawl material that’s hidden in deep/invisible web databases. Specifically, their technology will be able to automatically complete online forms and then extract the data that’s returned on a results page.
Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for. Then, Glenbrook’s Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such as the location of their desired job, in the same way a human would.
I haven’t personally seen the Glenbrook technology in action but I’ve been reading about similar types of automated deep web database extraction for many years.
Here are two research papers that might be of interest:
+ On the Automatic Extraction of Data from the Hidden Web
PDF; 14 pages.
+ Testbed for Information Extraction from Deep Web
Some issues that quickly come to mind about the use of this type of technology are:
+ Legal. Extracting and repurposing all of raw info from a database, is it legal?
Lots of data remains hidden in public record databases. Is it legal to crawl and repurpose this data?
+ Server load.
+ Updates and Recrawl. Info in these databases can be very dynamic.
+ Will the data have the same usability/searchability?
Btw, on a somewhat related note. The article mentions that this technology could be used to find job listings on company web sites. It’s worth noting that although Indeed.com and SimplyHired.com don’t extract job listings from corportate “databases” THEY DO crawl job listings posted directly to company web sites. For example, Indeed.com recently announced that they’re now crawling the career pages of most of the Fortune 500 companies. At SimplyHired.com you’re able to limit your search to job postings from companies that appear on several business ranking lists.
Postscript: The article notes that Google provides access to bilbiographic records from OCLC’s Worldcat. Yahoo also offers access to this material. In fact, they even offer a special, co-branded version of the Yahoo Toolbar that allows you to get at this data quickly.