Friday, July 30, 2010
 

20+ Differences Between Internet vs. Enterprise Search - Part 2

Last Updated Mar 2009


20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care (Part 2)

Back to Part 1 - this is Part 2 - ahead to Part 3

 

Part 2: Technical:
Spidering and indexing

In our last installment we talked about the technical differences between Internet and Enterprise search from the users' standpoint.  This month we'll talk more about the technical differences behind the scenes and the issues your IT or SCOE staff may encounter. The line between the visible front-end and back-end issues a user sees is a bit hazy – many of the advanced features which are visible on the front end search UI are actually using data stored in the fulltext index at index time, a back-end process.  On a misconfigured system end users won't directly see error messages from a badly broken spider, but they'll certainly notice that they are not finding their data anymore.  Keep in mind there is actually a fair amount of overlap between the "front end" we described in our last installment and this month's "back end" article, but we've tried to organize items by where an IT person might initially think to look for them.

Editor's Note:  This newsletter article is a summary of a new White Paper we're working on.  If you'd like a copy of the final version, please email us at info@ideaeng.com

  Part 2: Outline / Contents     [Part 1], [Part 3]
Intro / Recap of Part 1
Content is not always static or "quiescent"
Examples of rapidly changing content & applications
Sidebar: The Problem with Partial Document Updates
Sidebar: A Call for Conjoined Search Indices
Server Side Dynamic Content
GET vs. POST
Exposed CGI vs. Hidden CGI
Server Side vs. Client Side Dynamic Content
Sidebar: Spider Un-Friendly Content Has NON-Spider Consequences - Worse Than You Might Think!
Sidebar: Summary of Dynamic Content and Potential Issues
Server Side Programs in General
Type: short standard "CGI Links" (HTTP GET / HTML hyperlink)
Type: Longer or Complex CGI Links (also HTTP GET / HTML hyperlink)
Type: CGI GET in Forms (HTTP GET / HTML Forms)
Type: CGI POST in Forms (HTTP POST / HTML Forms)
Type: Hidden CGI / Masked CGI / Mapped CGI
Type: Client Side Dynamic Content
RDBMS: Database records aren't just "web pages" without Links
Your Data – Document Level
Compound Documents
Composite Documents and Meta Data
Even More Document Level Processing
Pairing Data with Navigators
Coming Up In Our Next Installment…
Read Part 3

Content is not always static or "quiescent"

Search engines assume that for the most part, once a document or web page is created, it will rarely if ever be modified, but highly specialized applications may change documents or database records frequently. In many search engines, changing even one little attribute of a document causes the entire document to be re-filtered and reprocessed.

[back to top]

Examples of rapidly changing content might include:

  • eDiscovery systems and Legal "War Room" applications
  • E-commerce search engines with deep customer database and inventory integration
  • Busy process centric Content Management Systems (CMS)
  • Search-enabled company email servers
  • Community-driven social/tagging driven sites
  • Smart storage appliances with search and security
  • Real-time content such as news feeds, blogs, and real-time Automated Message Handling Systems (AMHS)

All of these systems break the mold for static content search engines.

[back to top] [back to top] [back to top]

Server Side Dynamic Content

Dynamic content presents some specific problems for both Internet and Enterprise spiders.  For example, in our previous installment we talked about detecting duplicate and near duplicate pages, and dynamically generated pages are often associated with such problems.  But a more basic problem is how the URLs for the dynamic pages are exposed and whether the spider will follow them, and whether or not the content requires a POST.  If you have dynamic content, you should not blindly assume that it will be handled.

[back to top]

GET vs. POST

Have you ever noticed the weird URLs at the top of your browser?  When you see a long URL, especially if it contains a question mark (?), ampersands (&), and equal signs (=), you are dealing with pages that are being dynamically generated by a CGI program of some sort and that program is using a particular style of communications called an "HTTP GET". This GET style of CGI communications could be used in both traditional underlined hyperlinks or from an on-screen form.  This type of dynamic content is usually ignored by spiders. A few specialized vendors do handle this type of situation, often referred to as accessing "deep" content, or "deep web" spidering.

[back to top]

Exposed CGI vs. Hidden CGI

We may know that we are dealing with CGI scripts because we see either special punctuation characters in a long URL or because there is an onscreen form to fill out and submit.  But it's also possible for CGI content to be mapped into normal URL space.  The good news is that spiders will generally follow these links, and this may even be one of the primary motivations for this type of URL.  But there's still one other potential problem.  Imagine that a calendar application always presents links to the next and previous days, and the next and previous months, and the next and previous years, even if there are no events scheduled yet.  I once helped adjust a spider that had indexed every empty month, week, day and hour all the way through the year 2039 before it gave up – in this case it had used up the license page count.  So the problem is that masked CGIs can still create what is called a "spider trap", a set of seemingly valid links that is essentially useless and infinite – that will never run out of links once it strays into such a trap.

Smart spiders know how to mostly avoid these traps automatically, and the spider administrator can train other spiders to avoid some specific traps.  If your site uses masked URLs, there may be some trial and error as you adjust the rules, and it may be an iterative process as you rescue the spider from one trap, only to have it fall into another.  It gets a bit more complicated if your boss wants "current" calendar events spidered, but not empty schedules from 2039.

[back to top]

Server Side vs. Client Side Dynamic Content

We've been careful to keep mentioning the web server in all of this.  With the advent of Java, Java Script / AJAX / JSON, Active X, Flash and an alphabet soup of other browser technologies, client side scripting is also very popular these days.  All of these technologies run on the local computer's (or phone's!) browser, vs. the traditional CGI that runs back on the web site's server.

The problem is that a majority of spiders do not handle this type of content at all, or offer very limited support for it.  We have seen a few that offer limited Java Script and Flash support by using specialized parsers or by embedding a virtual machine inside the spider, but even then the penetration through multiple levels of navigation is severely limited.

  [back to top]
[back to top] [back to top]

RDBMS:  Database records aren't just "web pages without links"

In the typical Internet spider, everything is a web page and has a corresponding URL.  In classic search engine terms, we'd say that the web pages are the "documents" (the unit of retrieval).  When search engines are sent after a database, it's common to map database records from a main table into search engine documents.  So, for example, if a database has a catalog of 100,000 products, the search engine will wind up with 100,000 product documents.

There are a number of technical details that are different enough from a normal web crawler.

  • Different ways of accessing data (connector, web, api, etc)
  • Defining Virtual Documents
  • URL access to matching documents – what will users click on to view a match?
  • Latency – how updates are recognized – Push, Pull, or Poll
  • Relevancy – where hyperlinks are likely not available
  • Database joins – typically done at index time vs. search time
  • Transactions and advanced database features, including record level locking, synchronous updates, commit/rollback, complete recordset iteration, joins, Non-text datatypes and schemas, Arbitrary Sorting and "group by", Statistics functions, 8-bit, 16-bit and Unicode character sets, and QOS/failover.

Please refer to the full text of the white paper (when available) for a complete discussion, or drop us an email.

[back to top]

Your Data – Document Level

The documents being indexed inside corporations and agency firewalls can also deviate from the run-of-the-mill web pages and PDFs that public spiders typically handle.  One assumption is that one web page or file equals one document, where a "document" in search is the basic unit of retrieval.  While this is often true, particular datasets and business logic can run counter to that assumption, and if not compensated for, can compromise search results accuracy and relevance.

[back to top]

Compound Documents

In some cases, one large file may contain a number of smaller documents. A file that contains multiple logical documents but has been incorrectly indexed as a single document will tend to match more searches than it should, so when users open that document, they may be confused as to why it matched. Compound documents may also be ranked incorrectly.

Some examples of Compound Documents include:

  • FAQs with multiple questions
  • ZIP and TAR files, or PST email archives
  • Email messages with attachments
  • Compound CMS resources
  • Giant PDF files
  • PowerPoint slides with logical boundaries (a rare requirement)
  • Documents with important section or page boundaries

The last few items are for rather specialized applications and are often not required.

[back to top]

Composite Documents and Meta Data

Sometimes the data that constitutes a single logical document is actually stored in several places, and must be joined together before being sent to the search engine for indexing.  Sometimes different parts of the document exist on different systems.  An example of this could include a digital media and assets tracking system, where the main drawing or clip are stored on a media server, but where the description and specifications are stored in a database.

A more common case of Composite Documents is where the main document resides in one place, but meta data for that document must be drawn in from other sources.  meta data often comes from external sources, typically from within the document itself, an external source, or can be parsed, extracted, or inferred from the document text or another available field. Meta data from all these sources may require further processing to validate and normalize it into a consistent format or to use a consistent name for each item across all data sources.  Some specialized applications and data sources can have well over a thousand different attributes, all of which need to be properly indexed, which is in stark contrast to generic and Internet search.

An example of this type of system would be a set of product documentation stored on a file server folder hierarchy, but where additional information about document versions and releases are stored in Oracle.  At index time, the files from the file share must be combined with the results of an Oracle query before being submitted to the search engine indexer.

[back to top]

Even More Document Level Processing

In addition to the work that can be done on documents as they enter the system together and normalize meta data and entities, there are some higher level functions that can be performed. 

Through an extension of entity extraction, rules can be created that cover the entire scope of a web page or document.  You can make assumptions about data based on its location or form, and extract other data.  High-level rules also help identify which parts of the text are the central content, versus ancillary page navigation links or copyright notices.  This helps fix the problem where certain search terms match every page on a site, because the term is used in a site wide navigator.  Yes, smart engines automatically devalue a term that appears on all pages in relevancy calculations, but if the user entered that term as single word query, the engine has little choice but to return all documents, and it will have only minor variations in the number of times the word appears on each page to do it's rankings.  Explicitly trimming off the common, non-central parts of pages before sending them to the indexer is more precise, but of course developing and testing such rules for a broad range of pages can be time-consuming.

[back to top]

Pairing Data with Navigators

One other consideration about your data, at both the individual document level and at the overall organization of it (the "corpus" level), is that of pairing your data with the appropriate results list navigator technology.  Vendors offer a staggering selection of clickable, results list doohickeys for users to drill down their searches on, technologies with names like "parametric search", "faceted navigation", "tag clouds", "taxonomies", "automatic clustering"… the list goes on and on.  To the casual search user these all look about the same, but the implementation of these various techniques is radically different on the back end.  The reason you should care is because different techniques are best suited for certain types of data.  Some techniques work better than others, but may require more structured data up front or document preprocessing at index time.

[back to top]

Coming Up In Our Next Installment…

The last installment of this 3-part series will review a number of additional technological points and will discuss some of the business implications of the differences between Enterprise and Internet search. 

Read Part 3 now.

New Idea Engineering always welcomes your questions and input.  Feel free to contact us at info@ideaeng.com

[back to top]
Continue to [read Part 3]
or go back to [Part 1]

 

Rate this:
Recent Comments
There are currently no comments. Be the first to make a comment.
Subscribe
First Name
Last Name
Email Address
Current Search Platform
Please verify
Enter the code shown above:
Copyright 1996-2009 by New Idea Engineering, Inc.
Privacy Statement Terms Of Use