Mining the Deep Web
Last Updated Jan 2009
By: Mark Bennett & Abe Lederman and Sol Lederman - Deep Web Technologies, LLC - Issue 6 - January / February 2004
[ In this first article in a series we introduce the deep web and tell you why, as a business or scientific professional you should care about mining its content. In later articles we will discuss in more depth some of the technical challenges to mining the deep web and how Deep Web Technologies and other companies are meeting those challenges.]
The Internet is vast and growing - that's not news. Google does a great job of finding good information within it - that's not news either. What is news, and one of the dirty little secrets of Internet search engines, is that there's a huge collection of really useful content on the Internet that Google will never find - nor will any of its competitors, or any single search engine for that matter. We like to think that Google knows all, that if we click through enough of its search results we'll find whatever we need. This just isn't so. Beyond the 'surface web' of content that's continuously mined is the 'deep web'.
So, you're wondering, 'What is the deep web?' and 'Why haven't I ever heard of it?' In reality you've probably searched the deep web, maybe even surfed it, and never even realized it. The deep web is the collection of content that lives inside of databases and document repositories, not available to web crawlers, and typically accessed by filling out and submitting a search form. If you've even researched a medical condition at the National Library of Medicine's PubMed database http://www.ncbi.nlm.nih.gov/PubMed/) or checked the weather forecast at weather.com then you've been to the deep web.
Three nice properties of deep web content are that it is usually of high quality, very specific in nature, and well managed. Consider the PubMed example. Documents cited in PubMed are authored by professional writers and published in professional journals. They focus on very specific medical conditions. The National Library of Medicine spends money to manage and make their content available. Weather.com provides timely and specific reports of weather conditions for all of the United States and much of the rest of the world as well. Both collections share the three properties.
The deep web is everywhere, and it has much more content than the surface web. Online TV guides, price comparison web-sites, services to find out of print books, those driving direction sites, services that track the value of your stocks and report news about companies within your holdings - these are just a few examples of valuable services built around searching deep web content.
So, why doesn't Google find me this stuff? The answer is that Google isn't programmed to fill out search forms and click on the submit button. The problem is that there are no standards to guide software like the smarts behind Google in how to fill out arbitrary forms. In fact, computers don't 'fill out' and submit forms, they instead interact with the web server that's presenting the form, and send it the information that specifies the query plus other data the web server needs. Each web form is different and there are too many of them so Google can't know how to search them all. Plus, it currently takes a human to 'reverse engineer' a web form to determine what information a particular web server wants. Standards are emerging to help with the content access problem and software will certainly get better at filling out unfamiliar forms but we have a long way to go before most of the deep web is accessing to the next generation of web crawlers.
While filling out that web form is non-trivial it isn't the only barrier to accessing the deep web and it isn't even the hardest problem. Finding the best, or most relevant, content is harder. Within the deep web it means searching multiple sources, collating the results, removing duplicates and sorting the remaining results by some criteria that is meaningful to the person doing the searching. The problem of finding, aggregating, sorting and presenting relevant content is an involved one that we don't want to just gloss over so we will dedicate an entire article to discussing the issues.
As a professional you should care about What's in the deep web and about how to mine it effectively and efficiently. 'Why is that?' you ask. It's simple. In the worlds of business, science and other professional endeavors time is money. The slow and steady tortoise may win the race in fairy tales but it's going to get run over or left in the dust in today's competitive marketplace. The race to bring a new product to market, whether it be a new computer chip or a new drug, will be won by the company that can most quickly gather the most relevant information and intelligence and execute on it before its competitors do. A tool that can fill out forms on a number of web-sites with that high quality, specific and well managed content -- whether it be purchased, internal, or publicly available content -- then do the heavy duty processing to deliver the best of the best documents is worth its weight in gold. Such a tool will save you time and money and will make the best use of the content that you pay to acquire.
Imagine taking all of the intellectual property you possess or to which you have access and integrating its access into one simple to use form. Imagine further a system that knows what makes a certain document relevant to you as an individual. This system would be customized to scour your content plus all sorts of knowledge bases relevant to your needs and sift and sort information to present you with the very best of the deep web on demand. It would save you time. It would help you make money. This is the promise of deep web mining.
[ Abe Lederman is founder and president of Deep Web Technologies, LLC, (DWT), a Los Alamos, New Mexico company that develops custom deep web mining solutions. His brother Sol supports Abe in a variety of ways. Visit http://deepwebtech.com for more information including links to sites successfully deploying DWT's sophisticated search applications. ]