Challenges of the Deep Web Explorers

« NIE Newsletter

Challenges of the Deep Web Explorers

Web spiders these days, it seems, are a dime a dozen. Not to minimize the tremendous value that Google and other search engines provide, but the technology that gathers up or “spiders” web pages is pretty straightforward. Spidering the surface web, consisting mostly of static content that doesn’t change frequently, is mostly a matter of throwing lots of network bandwidth, compute power, storage and time at a huge number of web sites. Merely throwing lots of resources at the deep web, the vast set of content that lives inside of databases and is typically accessed by filling out and submitting search forms, doesn’t work well. Different strategies and a new kind of “deep web explorer” are needed to mine the deep web.

Surface web spiders work from a large list, or catalog, of known and discovered web sites. They load each web site’s home page and note its links to other web pages. They then follow these new links and all subsequent links recursively. Successful web crawling relies on the fact that site owners want their content to be found and that most of a site's content can be accessed directly, or by following links from the home page. We can say that surface web content is organized by an association of links, or in HTML jargon, an association of <A HREF> tags. We should note that spidering is not without its hazards. Spiders have to be careful to not recrawl links that they’ve previously visited lest they get tangled up in their own webs!

If spidering the surface web is not an impressive achievement then what makes Google’s technology so highly touted? In the case of Google and of other good search engines what’s impressive is not the ability to harvest lots of web pages (although Google currently searches over four billion pages) but what the engine does with the content once it finds it and indexes it. Because the surface web has no structure to it good search technology has to make relevant content easy to find. In other words, a good search engine will create the illusion of structure, presenting related and hopefully relevant web pages to a user. Google’s claim to fame is its popularity-based ranking. It structures content by presenting first to the user web pages that are most referenced by other web pages. The deep web is a completely different beast. A web spider trying to harvest content from the deep web will quickly learn that there are none of those <A HREF> links to content and no association of links to follow. It will realize that most deep web collections don’t give away all of their content as readily as surface web collections do. It will quickly find itself faced with the need to speak a foreign language to extract documents from the collection. This need is definitely worth meeting since the quantity and quality of deep web content is so much greater than that of the surface web.

Deep web explorers approach content searching in one of two ways, they either harvest documents or they search collections on the fly. A deep web explorer may attempt to harvest content from a collection that doesn’t support harvesting but, for reasons cited below, the effort will likely not be very fruitful. Dipsie and BrightPlanet are harvesters. They build large local repositories of remote content. Deep Web Technologies and Intelliseek search remote collections in real time.

Harvesting and real time search approaches each have their pluses and minuses. Harvesting is great if you have adequate infrastructure to make the content you’ve collected available to your users and if you have a sufficiently fat network pipe plus enough processing and storage resources to get, index and save the content you’ve obtained. Harvesting isn’t practical if the search interface doesn’t make it easy to retrieve lots of documents or if it’s not easy to determine how to search a particular collection. If the collection doesn’t support a harvesting protocol then harvesting will not retrieve all documents. Additionally, not having the network bandwidth and other resources makes harvesting impractical. And, if a collection is constantly adding documents then either the collection is going to somehow identify new content or you’re going to waste lots of resource retrieving the documents already in your local repository just to get a few new documents.

OIA, the Open Archives Initiative, is an example of a harvesting protocol. OIA describes a client-server model useful for aggregating multiple collections into a single centralized collection. The server tells the client, among other things, what documents are new in its collection and the client updates its repository with them.

Deep Web Technologies’ (DWT) Distributed Explorit application implements the other approach, the real-time search approach, which also has its pluses and minuses. A tremendous plus is that most deep web collections lend themselves to real-time searching even if they don’t lend themselves to harvesting. This is because by not implementing a harvesting protocol the content owner doesn’t have to do anything to its documents to allow them to be searched; it doesn’t need to generate metadata or otherwise structure its content. An on-the-fly search client uses the simple HTTP protocol to fill out and submit a web-form that initiates a query against the content database. The client then processes (parses) the content returned and displays search results to the user. DWT’s Distributed Explorit does multiple simultaneous real-time searches against different collections then aggregates the results and displays them to the user. The minuses of the harvesting approach become pluses in real-time searching. That entire infrastructure you needed to retrieve, store, refresh and index remote content and to then provide access to it disappears.

Minuses of real-time searching are the ongoing demands placed on the remote collection, the reliance on the availability of the remote content, the vulnerability of depending on search forms that change or break, and the inability to rank documents in a homogenous and effective way. (Search engines are notorious for ranking poorly or not at all and even collections that do rank documents in a relevant way can’t deal with the fact that their well-ranked documents will likely be aggregated with documents from other poorly ranked documents.)

Now that we’ve tapped into the vast content of the deep web we quickly discover that we’re drowning in content and not all of it is so relevant. What’s a web explorer to do with so many documents? We’ll explore this question next time.

[ Abe Lederman is founder and president of Deep Web Technologies, LLC, (DWT), a Los Alamos, New Mexico company that develops custom deep web mining solutions. His brother Sol supports Abe in a variety of ways. Visit http://deepwebtech.com for more information including links to sites successfully deploying DWT's sophisticated search applications. ]