Search this site:
Enterprise Search Blog
« NIE Newsletter

Beyond Information Clutter

Last Updated Jan 2009

By: Mark Bennett & Abe Lederman and Sol Lederman - Deep Web Technologies, LLC., Volume 2 - Issue 1 - June 2004

Let’s face it, there’s way too much content in the Internet. While finding enough of it was once a big deal that’s not the case anymore. Huge armies of web crawlers now make their home on the Internet, continuously examining and cataloging documents from all corners of the Web, making them all accessible through their search engines. Today’s problem is sorting through all that content, finding what you want, and keeping the clutter down.

Google is my favorite search engine but it has some serious limitations. In 0.23 seconds Google can identify more documents than I can read in a thousand lifetimes! Fine, you say, Google ranks them for you so what you’re likely to be most interested in will show up in the first 10 hits, right? Not always. Google mostly searches the surface web. That’s the collection of documents that web crawlers can easily catalog. It doesn’t include the deep web, which is the collection of content (much, much larger than the surface web collection) that typically lives inside of databases and has the higher quality scientific and technical information. Web crawlers don’t know how to search deep web collections so they miss much of the content that serious researchers are seeking. So, the millions of documents that Google finds might not include the ones you’re looking for, either in its first page of hits or in the first thousand pages of hits.

A more serious problem that Google faces, and I’m not picking on Google – all web crawlers have limitations – is that it doesn’t necessarily rank documents the way you would rank documents. Google ranks a web page highly if it’s popular, i.e. if lots of web pages reference it. Popular documents are not necessarily the ones most relevant to you. Google should be praised, however, in that it does rank in a way that is very useful to many people in many situations. Other search engines rank very poorly and some don’t rank in any useful way at all. This is especially true for deep web search engines.

Let’s consider one more problem that keeps you from finding what you really want, especially if you’re looking for technical or scientific content. For this type of content you’ll likely use a deep web search engine, like the one that Deep Web Technologies developed for http://science.gov which searches a number of technical and scientific databases and aggregates results into a single results page. If we consider that each of the document sources that returns documents ranks them in its own way then we have the problem of how to rank the documents within the aggregate result set. In other words, if one source ranks documents more highly if they’ve been published recently and a second source ranks documents in alphabetical order by title and a third source ranks documents by frequency of search terms within them how does one rank the aggregate set of documents returned from the three sources?

This relevance ranking problem as it’s called is a messy one but a very important one to solve because without good relevance ranking the result set becomes filled with clutter, i.e. with documents you didn’t want. The solution has two parts to it. First, identify what makes a document relevant to a researcher and, second, analyze all documents against those criteria. The first part is fairly straightforward. Good current and envisioned search utilities can allow the user to select from a set of criteria. Examples are publication date of article, length of article, frequency of search terms within it, proximity of search terms to one another, and presence of search terms early in the document.

Solving the second part of the problem, the rank and aggregation part, turns out to be difficult to do and requires lots of CPU, storage and network resources and it places a burden on the computer hosting the documents. It essentially requires retrieving the full text of the documents being compared in the ranking process. Without retrieving and analyzing entire documents we have no way of measuring the worth of the document against the user-specified criteria since the collection itself may rank documents poorly and most likely not according to our standards. Additionally, because different document sources rank differently the only way to rank all documents against one set of criteria, regardless of which collections they came from, is to ignore the order in which the source returns (ranks) the documents and apply our own ranking approach.

This approach which we call “deep ranking” is not for the impatient but for he who values a thorough search. Retrieving and analyzing the full text of thousands, perhaps tens of thousands of documents in response to a single query is best done in a batch-oriented environment, as it doesn’t lend itself to real-time processing. The model is one of submitting a search and receiving an email when the processing is completed with links to the most relevant documents. There is some instant gratification here, however, as our approach will retrieve and rank a number of documents quickly using our “QuickRank” technology, currently in production at http://science.gov. The second benefit is the knowing that a large number of potentially relevant documents have been scoured and that only the best ones have been retained. Additionally, that small set has been optimally ranked and tailored to the needs of the researcher. It’s worth the wait.

Deep Web Technologies has researched different ways to perform relevance ranking and has created a novel approach that can effectively mine large numbers of text document from heterogeneous sources and document type and produces a single set of well-ranked documents. The approach utilizes different algorithms for processing different types of documents at varying degrees of thoroughness. This approach can yield great benefits to pharmaceuticals, legal firms, biotechnology companies and other enterprises needing to effectively separate the clutter from the content.

[ Abe Lederman is founder and president of Deep Web Technologies, LLC, (DWT), a Los Alamos, New Mexico company that develops custom deep web mining solutions. His brother Sol supports Abe in a variety of ways. Visit http://deepwebtech.com for more information including links to sites successfully deploying DWT's sophisticated search applications. ]