Internet vs. Enterprise Search - Part 2

« NIE Newsletter

20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care ^{(Part 2)}

Back to Part 1 - this is Part 2 - ahead to Part 3

Part 2: Technical:
Spidering and indexing

In our last installment we talked about the technical differences between Internet and Enterprise search from the users' standpoint. This month we'll talk more about the technical differences behind the scenes and the issues your IT or SCOE staff may encounter. The line between the visible front-end and back-end issues a user sees is a bit hazy – many of the advanced features which are visible on the front end search UI are actually using data stored in the fulltext index at index time, a back-end process. On a misconfigured system end users won't directly see error messages from a badly broken spider, but they'll certainly notice that they are not finding their data anymore. Keep in mind there is actually a fair amount of overlap between the "front end" we described in our last installment and this month's "back end" article, but we've tried to organize items by where an IT person might initially think to look for them.

Editor's Note: This newsletter article is a summary of a new White Paper we're working on. If you'd like a copy of the final version, please email us at info@ideaeng.com

Part 2: Outline / Contents [Part 1], [Part 3]

Intro / Recap of Part 1
Content is not always static or "quiescent": Examples of rapidly changing content & applications
Sidebar: The Problem with Partial Document Updates
Sidebar: A Call for Conjoined Search Indices
Server Side Dynamic Content: GET vs. POST; Exposed CGI vs. Hidden CGI; Server Side vs. Client Side Dynamic Content
Sidebar: Spider Un-Friendly Content Has NON-Spider Consequences - Worse Than You Might Think!
Sidebar: Summary of Dynamic Content and Potential Issues: Server Side Programs in General; Type: short standard "CGI Links" (HTTP GET / HTML hyperlink); Type: Longer or Complex CGI Links (also HTTP GET / HTML hyperlink); Type: CGI GET in Forms (HTTP GET / HTML Forms); Type: CGI POST in Forms (HTTP POST / HTML Forms); Type: Hidden CGI / Masked CGI / Mapped CGI; Type: Client Side Dynamic Content; RDBMS: Database records aren't just "web pages" without Links
Your Data – Document Level: Compound Documents; Composite Documents and Meta Data
Even More Document Level Processing
Pairing Data with Navigators
Coming Up In Our Next Installment…: Read Part 3

Content is not always static or "quiescent"

Search engines assume that for the most part, once a document or web page is created, it will rarely if ever be modified, but highly specialized applications may change documents or database records frequently. In many search engines, changing even one little attribute of a document causes the entire document to be re-filtered and reprocessed.

[back to top]

Examples of rapidly changing content might include:

eDiscovery systems and Legal "War Room" applications
E-commerce search engines with deep customer database and inventory integration
Busy process centric Content Management Systems (CMS)
Search-enabled company email servers
Community-driven social/tagging driven sites
Smart storage appliances with search and security
Real-time content such as news feeds, blogs, and real-time Automated Message Handling Systems (AMHS)

All of these systems break the mold for static content search engines.

[back to top]

Sidebar: The Problem with Partial Document Updates

Note: Partial Updates are also sometimes referred to as "Incremental Indexing", but this can be confusing because at other times Incremental Indexing refers to the overall spidering process.

While some vendors advertise support for partial updates, in some applications this can mean thousands of changes an hour, and traditional search engines do not always handle that efficiently. In the legal war room example changes to one status flag in a document will require a partial update. That partial update, even of a single field, may still represent over 50% of the work of indexing the document from scratch; some engines actually spend more than 100% of the effort to do an update than a new insert. If thousands of documents are being flagged by paralegals, these updates can swamp the system.

The relative inefficiency of document updates is shocking news to most corporate buyers of search. They are often surprised that changing even one flag on a document can incur such a high processing cost. This situation is made worse by vendors who tout incremental indexing or partial/field level document updates. A related issue is document latency, which we'll get to below.

In defense of the vendors, their software will generally allow you to update individual fields, so they are not lying when they claim to support it. From a functional standpoint, they may even offer different API calls for updating fields in an existing document than for inserting a new one, which frees up the programmers from having to re-supply all of the other document data that has not changed. For most mainstream applications, the engine is so fast at reindexing changed documents that even the relatively high processing costs are still well within customer requirements, and end users need not worry. Problems usually crop up only in very large applications, or applications with very volatile meta data AND very demanding reindex performance requirements.

The reasons for these relative inefficiencies vary from vendor to vendor and have to do with very low level aspects of their software's indexing process. Again, these implementations arose from a design based on indexing and serving the web, where documents are less volatile, but that's the whole point of this series! These aren't "good" or "bad" engines, it's just that they are being asked to perform tasks that are orthogonal to their core design. Try driving your car on railroad tracks a little way and you'll see what I mean - the tracks are fine, and so was your car, but these two forms of transportation are different enough in their implementation that they don't interoperate well.

Another example within the search space is the enthusiasm Google's "Map/Reduce" software architecture is receiving in software engineering circles. Map/Reduce is a means of efficiently dividing up work into smaller units that can then be handed off to other machines, allowing for massively distributed processing. We've seen white papers by Microsoft's engineers critiquing Google's techniques, so you know the technique getting some real traction. In the Open Source community the Hadoop project attempts to implement the Google-esque "Map/Reduce" methodology. Some of this code was then used in the open source Nutch spider, using the open source Lucene search engine at its core. Great stuff, BUT... a fundamental assumption in the Hadoop engine is that once a document is indexed, it does not change. Hadoop splits up the work units and sends them off to other machines for processing, never to be seen again. If the spider later finds a web page that has changed, it is resubmitted to the same distributed process, and the older version of that document is marked as obsolete in the search index. This becomes a problem if a specialized application needs to do frequent document updates and also maintain low latency.

Google's success and amazing scalability on the public Internet is a testament to the amazing power of the Map/Reduce design, and how well it can efficiently manage billions of documents in a fully distributed and fault tolerant way, as long as high rates of near real-time updates and other aspects of transactional processing are not a high priority. This is not "Google Bashing" in any way; it's not about "good" or "bad" designs. It's just a question of software requirements and how priorities vary from application to application. There's more to say about the other transaction-oriented features some search apps require, which we'll get to later.

[back to top]

Sidebar: A Call for Conjoined Search Indices

We have a proposal for a relatively simple architectural change that would accommodate most (though not all) of the non-static document scenarios we've outlined above. We think it would be relatively easy to implement on top of existing search engine cores.

The basic idea is to divide the fulltext index of documents into two separate search indexes that are linked together in a 1:1 relationship. One index would optimized for the large, rarely changing document's text, let's call it index "DocIdx" and the other index would be heavily optimized for the relatively small but quite volatile document Meta Data, which I'll call "MetaIdx". These two indexes would still be exposed to higher level applications as a single fulltext index / collection, which I'll call FullIdx. In the search engine's kernel, however, searches submitted to FullIdx would be run against both DocIdx and MetaIdx, and the results would be seamlessly merged. Yes, this is less efficient than having a single index, but that fundamental inefficiency would still exist if programmers were forced to implement this logic at the application level, and such inefficiencies are more likely to be minimized closer to the core engine.

Warning: Heavy Nerd-Speak in the rest of this sidebar, please email us if you want any clarifications. Mark feels very strongly about this, and we're hoping it will make sense to the right people.

Imagine that at the API level a search is submitted for the fulltext keyword Ferrari in the main document and a Meta Data qualifier of model year=2005. I'll call this entire search query "Q". The core engine parses this query into its two components Q.fulltext:Ferrari and Q.meta:year=2005. It then easily determine that Q.fulltext need only be sent to DocIdx, and that Q.meta only needs to be run against MetaIdx.

This is an intentionally oversimplified example used to make a point. In the real world, the Meta Data index might include Titles and User Tags, which would also need to be searched for the term Ferrari. In that case the engine would need to combine those results, including relevancy calculations. This would be a bit tricky, but given the brilliant minds at work, I'm sure it can be handled. And again, if this type of thing isn't done at the search engine's core, then it would have to be done at the application level, which would be less efficient.

There are many advantages of doing this in the core engine. Queries can be precisely factored to create an optimized execution plan based on the actual field/index allocation. Low level results can be carefully combined, being respectful of low level TF and IDF scores, etc. The engine can eliminate records from the match set before results fields are fully marshaled - a huge savings - and this is something that can only be done efficiently at the kernel level. Low level sorting can be implemented as a simple N-scale merge sort, assuming the underlying subsystems are returning sorted results for each physical index.

Instead of calling the indexes "document" and "meta data", it would be more precise to label them "static" and "volatile" or "quiescent" and "volatile" We don't know for a fact that the document itself isn't what's being frequently changed, and some Meta Data may never change, so using the more generic "volatile" qualifier might be clearer, and will sound familiar to Java programmers at least.

I admit this design will not address all of the dynamic data scenarios we've discussed. In the case of many editors making frequent and substantial changes to a large number of MS Word documents, you really will have to refilter and reindex them over and over if you want maintain an up-to-the-minute index. The same would be true for blog comments. But those use cases are in the minority - most of the really tough and expensive scenarios involving rapidly changing content involve data that can easily be segmented into one of the two buckets: big-and-quiet or small-and-volatile.

Though I think this is a reasonable and relatively easy to implement fix, to date we don't know of any vendors who have implemented this solution. Vendors usually focus on keeping search up and running, with document indexing is a secondary concern; while this is an understandable set of priorities for *most* of their customers, we'd like to see at least one vendor take this on. Our other hope is that Lucene could perhaps be the first to try. If you know of anyone who has actually implemented such a system and can talk about it without getting in trouble, please let us know!

[back to top]

Server Side Dynamic Content

Dynamic content presents some specific problems for both Internet and Enterprise spiders. For example, in our previous installment we talked about detecting duplicate and near duplicate pages, and dynamically generated pages are often associated with such problems. But a more basic problem is how the URLs for the dynamic pages are exposed and whether the spider will follow them, and whether or not the content requires a POST. If you have dynamic content, you should not blindly assume that it will be handled.

[back to top]

GET vs. POST

Have you ever noticed the weird URLs at the top of your browser? When you see a long URL, especially if it contains a question mark (?), ampersands (&), and equal signs (=), you are dealing with pages that are being dynamically generated by a CGI program of some sort and that program is using a particular style of communications called an "HTTP GET". This GET style of CGI communications could be used in both traditional underlined hyperlinks or from an on-screen form. This type of dynamic content is usually ignored by spiders. A few specialized vendors do handle this type of situation, often referred to as accessing "deep" content, or "deep web" spidering.

[back to top]

Exposed CGI vs. Hidden CGI

We may know that we are dealing with CGI scripts because we see either special punctuation characters in a long URL or because there is an onscreen form to fill out and submit. But it's also possible for CGI content to be mapped into normal URL space. The good news is that spiders will generally follow these links, and this may even be one of the primary motivations for this type of URL. But there's still one other potential problem. Imagine that a calendar application always presents links to the next and previous days, and the next and previous months, and the next and previous years, even if there are no events scheduled yet. I once helped adjust a spider that had indexed every empty month, week, day and hour all the way through the year 2039 before it gave up – in this case it had used up the license page count. So the problem is that masked CGIs can still create what is called a "spider trap", a set of seemingly valid links that is essentially useless and infinite – that will never run out of links once it strays into such a trap.

Smart spiders know how to mostly avoid these traps automatically, and the spider administrator can train other spiders to avoid some specific traps. If your site uses masked URLs, there may be some trial and error as you adjust the rules, and it may be an iterative process as you rescue the spider from one trap, only to have it fall into another. It gets a bit more complicated if your boss wants "current" calendar events spidered, but not empty schedules from 2039.

[back to top]

Server Side vs. Client Side Dynamic Content

We've been careful to keep mentioning the web server in all of this. With the advent of Java, Java Script / AJAX / JSON, Active X, Flash and an alphabet soup of other browser technologies, client side scripting is also very popular these days. All of these technologies run on the local computer's (or phone's!) browser, vs. the traditional CGI that runs back on the web site's server.

The problem is that a majority of spiders do not handle this type of content at all, or offer very limited support for it. We have seen a few that offer limited Java Script and Flash support by using specialized parsers or by embedding a virtual machine inside the spider, but even then the penetration through multiple levels of navigation is severely limited.

[back to top]

Sidebar: Spider Un-Friendly Content Has NON-Spider Consequences - Worse Than You Might Think!

But it's not just spiders that suffer from inaccessible content! Like the classic "canary in a coal mine", if the spider is having issues, it may bring to light bigger problems, including legal ones! Here are some of the other problems that tend to conicide with spider-UN-friendly content:

ADA Compliance Issues
CEO / iPhone / "blackberry" Compliance
Search portal rankings, Search Engine Optimization (SEO) / Bad Habits that Cross the Firewall
Regulatory and eDiscovery Compliance
Ease of Debugging / content reporting

Of course if you're creating a gaming or mapping application, then there is a point at which spiders, mobile phone and blind users will probably not be able to use it. But I think this is only justified if that functionality is the core purpose of the application.

[back to top]

Sidebar: Summary of Dynamic Content and Potential Issues

Server Side Programs in General

This is a potential issue for all types of server side content.
Potential problems:
May form an infinite spider trap.
Security, cookies, and cache control overrides are also potential areas of concern.
[back to top]

Type: short standard "CGI Links" (HTTP GET / HTML hyperlink)

Recognizing them:
URLs have question marks and other punctuation, fewer than 200 characters, and no "session id" or other odd fields. They will often be in the form of a standard looking hyperlink. If this type of URL is observed as the result of submitting a form, see CGI GET Forms.
Potential problems:
These links are ignored by default by some spiders, but you can usually change this setting.
Possible remedies:
Check your vendor's documentation for the topics like "Enabling CGI Links" and "URL Include and Exclude Patterns"
[back to top]

Type: Longer or Complex CGI Links (also HTTP GET / HTML hyperlink)

Recognizing them:
Similar to the short form described above, but URLs are very long or complex, or have session or cookie related information.
This form of URL is very common in some Content Management Systems (CMS). For example, in Vignette every page on a site points to the same CGI entry point, and the arguments after the question identify which page and view is being requested. Other CMS applications will map this complexity into normal looking URLs (see Mapped/Hidden CGI below).
Potential problems: (in addition to above)
URLs may not be valid the next day.
URLs may not be valid for other users.
Variations of the URL may point to identical content, causing duplicate search results.
Possible remedies:
Check your vendor's documentation for the topics like "Handling Session IDs in URLs" and "URL Normalization"
[back to top]

Type: CGI GET in Forms (HTTP GET / HTML Forms)

Recognizing them:
The URLs look like the CGI Links described above, but are the result of submitting a form on a web page, vs. clicking on a hyperlink. An important technical detail is that the server usually can't tell whether a CGI GET was submitted by a form or by clicking on a hyperlink - we'll explain why this is useful in the Remedies section.
This is controlled in the "method" attribute of the HTML form tag, and you can view the HTML source to confirm this.
Potential problems:
Content will probably not be thoroughly indexed. Most spiders will not fill out HTML forms, or they may submit it only once with the default values.
Possible remedies:
In some cases, the page that is returned by submitting the default form will include lots of CGI Links that a spider can be told to follow. Remember we said that when using a GET, the server can't tell (or doesn't care) whether the request came from submitting a form or clicking on a hyperlink. And many CGI applications will give back a full set of hyperlinks that are equivalent to many versions of the filled in form, with different combinations of values filled in. Thinking about our calendar example, the initial page might start with drop down lists for the month, day, and year. But when the form is submitted with any values, the system will return a page that has links to other days and months, and most spiders can follow those links. One other detail is getting the spider to submit the initial form with default values. This is very simple, just look at the URL in the action attribute of the form tag, and use that as a seed.
Another workaround is to have another program generate a page of CGI links that simulate all possible combinations of form values, and have the spider start with that page. In this case we're replacing a CGI GET Form with a set of CGI GET hyperlinks, but the server shouldn't care. In the case of our calendar, we could have a Perl program generate a bunch of hyperlinks pointing to the calendar.pl program, each with a different combination of year, month, and day.
[back to top]

Type: CGI POST in Forms (HTTP POST / HTML Forms)

Recognizing it:
This situation will look similar to the GET Forms we just described, except that the resulting URL you get after submitting the form is very short and does not have a question mark. You can also check the HTML source code and look at the method attribute of the form tag; it will be set to "POST"
Potential problems:
Most spiders will not submit the form at all, or will submit it only once with default values.
Possible remedies:
You might double check the spider's documentation, it may be possible to prime it with values to submit, but this is quite rare.
The good news is that most (though not all) programs that accept a CGI POST will also accept a CGI GET. If this is the case, then you can use the techniques we just talked about in the CGI GET forms section above! There are a few older programs that only accept a POST, but this is a function of the CGI program, not of the spider. If that's the case, you might talk to the programmer to see if he can modify it to also accept GETs.
Editors note: This next paragraph uses terminology that may not be clear to all readers. Since it is a potentially important workaround, we're including it here for completeness for the programmers reading this article, but feel free to skip it if you're not one of them!
A more complex option is to have somebody write a lightweight shim to act as a GET-to-POST adapter proxy, and then spider using the GET-Form techniques described above, using the proxy as the entry point.
Finally you could look for a vendor or technology that handles this. Our company, New Idea Engineering, has a spider toolkit called XPump, and one of our partners, DeepWeb Technologies, specializes in this sort of thing.
[back to top]

Type: Hidden CGI / Masked CGI / Mapped CGI

Recognizing it:
These URLs look normal, except they are possibly a bit longer. You may see a set of URLs like:
http://company.com/marketing/2005/01/release012345.html
http://company.com/marketing/2005/01/release012346.html
...
http://company.com/marketing/2008/04/release027788.html
http://company.com/marketing/2008/04/release027789.html
You might suspect that this coming from some type of online publishing system, perhaps a more advanced Content Management System (CMS) that maps CGI arguments into "friendly" URLs, but you wouldn't know for sure. A programmer could also look at the HTTP headers that were sent back from the server and the Meta Data in the head section of the HTML, looking for various technical clues.
The good news is that it usually doesn't matter either way. Generally spiders will index this content.
Potential problems:
This type of CGI, like all other forms of CGI content, is still susceptible to infinite link spider traps. Spiders can usually be configured to address this, though identifying and resolving all occurrences can be time consuming on large installations.
Cache control overrides that are too aggressive can cause excess spider traffic, and other generic CGI issues discussed at the top of the sidebar may also apply.
Remedies:
General "spider trap" avoidance.
[back to top]

Type: Client Side Dynamic Content

Includes Java, Java Script / Ajax / JSON / DHTML, Flash, Active-X, etc.
Recognizing it:
There's no single method for spotting this, but it often coincides with security warnings from your browser about Java, Java Script, download and run warnings, Flash or other downloads, etc. It is also suspicious if the site only runs in Internet Explorer or another specific browser, or only runs on Windows.
For security reasons, most browsers let you turn off all active elements, Java Script, downloads, etc. This is also another handy way of quickly spotting sites that require client side technology which a spider will likely not handle. If you have an older test machine, you might install a browser and turn everything off. It's hard to operate in this mode on your main machine, since so many sites require some of these features, but it's a nice task to delegate to an older machine.
Another test is to bring up a web page on Windows, do an Edit / Select All and then an Edit / Copy, and then paste this into Notepad. This is a decent approximation of what a search engine spider and indexer will see when it looks at that page. If there's no text, you've likely got a problem. This technique also helps spot text that is rendered as graphics with no "alt tag"
Potential remedies:
Use the "text only" version of those pages, if available.
If no alternative is available, talk to the creator. Often, if a site is not spider accessible, it's probably also not accessible to visually impaired users with screen readers, nor your mobile users, nor public search portal spiders. See the sidebar "Spider Un-Friendly Content - Worse than you think!"
Another workaround is to get the data into the search engine by some other means, such as via JDBC (Java database connector).

[back to top]

RDBMS: Database records aren't just "web pages without links"

In the typical Internet spider, everything is a web page and has a corresponding URL. In classic search engine terms, we'd say that the web pages are the "documents" (the unit of retrieval). When search engines are sent after a database, it's common to map database records from a main table into search engine documents. So, for example, if a database has a catalog of 100,000 products, the search engine will wind up with 100,000 product documents.

There are a number of technical details that are different enough from a normal web crawler.

Different ways of accessing data (connector, web, api, etc)
Defining Virtual Documents
URL access to matching documents – what will users click on to view a match?
Latency – how updates are recognized – Push, Pull, or Poll
Relevancy – where hyperlinks are likely not available
Database joins – typically done at index time vs. search time
Transactions and advanced database features, including record level locking, synchronous updates, commit/rollback, complete recordset iteration, joins, Non-text datatypes and schemas, Arbitrary Sorting and "group by", Statistics functions, 8-bit, 16-bit and Unicode character sets, and QOS/failover.

Please refer to the full text of the white paper (when available) for a complete discussion, or drop us an email.

[back to top]

Your Data – Document Level

The documents being indexed inside corporations and agency firewalls can also deviate from the run-of-the-mill web pages and PDFs that public spiders typically handle. One assumption is that one web page or file equals one document, where a "document" in search is the basic unit of retrieval. While this is often true, particular datasets and business logic can run counter to that assumption, and if not compensated for, can compromise search results accuracy and relevance.

[back to top]

Compound Documents

In some cases, one large file may contain a number of smaller documents. A file that contains multiple logical documents but has been incorrectly indexed as a single document will tend to match more searches than it should, so when users open that document, they may be confused as to why it matched. Compound documents may also be ranked incorrectly.

Some examples of Compound Documents include:

FAQs with multiple questions
ZIP and TAR files, or PST email archives
Email messages with attachments
Compound CMS resources
Giant PDF files
PowerPoint slides with logical boundaries (a rare requirement)
Documents with important section or page boundaries

The last few items are for rather specialized applications and are often not required.

[back to top]

Composite Documents and Meta Data

Sometimes the data that constitutes a single logical document is actually stored in several places, and must be joined together before being sent to the search engine for indexing. Sometimes different parts of the document exist on different systems. An example of this could include a digital media and assets tracking system, where the main drawing or clip are stored on a media server, but where the description and specifications are stored in a database.

A more common case of Composite Documents is where the main document resides in one place, but meta data for that document must be drawn in from other sources. meta data often comes from external sources, typically from within the document itself, an external source, or can be parsed, extracted, or inferred from the document text or another available field. Meta data from all these sources may require further processing to validate and normalize it into a consistent format or to use a consistent name for each item across all data sources. Some specialized applications and data sources can have well over a thousand different attributes, all of which need to be properly indexed, which is in stark contrast to generic and Internet search.

An example of this type of system would be a set of product documentation stored on a file server folder hierarchy, but where additional information about document versions and releases are stored in Oracle. At index time, the files from the file share must be combined with the results of an Oracle query before being submitted to the search engine indexer.

[back to top]

Even More Document Level Processing

In addition to the work that can be done on documents as they enter the system together and normalize meta data and entities, there are some higher level functions that can be performed.

Through an extension of entity extraction, rules can be created that cover the entire scope of a web page or document. You can make assumptions about data based on its location or form, and extract other data. High-level rules also help identify which parts of the text are the central content, versus ancillary page navigation links or copyright notices. This helps fix the problem where certain search terms match every page on a site, because the term is used in a site wide navigator. Yes, smart engines automatically devalue a term that appears on all pages in relevancy calculations, but if the user entered that term as single word query, the engine has little choice but to return all documents, and it will have only minor variations in the number of times the word appears on each page to do it's rankings. Explicitly trimming off the common, non-central parts of pages before sending them to the indexer is more precise, but of course developing and testing such rules for a broad range of pages can be time-consuming.

[back to top]

Pairing Data with Navigators

One other consideration about your data, at both the individual document level and at the overall organization of it (the "corpus" level), is that of pairing your data with the appropriate results list navigator technology. Vendors offer a staggering selection of clickable, results list doohickeys for users to drill down their searches on, technologies with names like "parametric search", "faceted navigation", "tag clouds", "taxonomies", "automatic clustering"… the list goes on and on. To the casual search user these all look about the same, but the implementation of these various techniques is radically different on the back end. The reason you should care is because different techniques are best suited for certain types of data. Some techniques work better than others, but may require more structured data up front or document preprocessing at index time.

[back to top]

Coming Up In Our Next Installment…

The last installment of this 3-part series will review a number of additional technological points and will discuss some of the business implications of the differences between Enterprise and Internet search.

Read Part 3 now.

New Idea Engineering always welcomes your questions and input. Feel free to contact us at info@ideaeng.com

[back to top]
Continue to read [Part 3]
or go back to [Part 1]

20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care (Part 2)

Part 2: Technical: Spidering and indexing

Content is not always static or "quiescent"

Sidebar: The Problem with Partial Document Updates

Sidebar: A Call for Conjoined Search Indices

Server Side Dynamic Content

Sidebar: Spider Un-Friendly Content Has NON-Spider Consequences - Worse Than You Might Think!

Sidebar: Summary of Dynamic Content and Potential Issues

Your Data – Document Level

Even More Document Level Processing

Pairing Data with Navigators

Coming Up In Our Next Installment…

20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care ^{(Part 2)}

Part 2: Technical:
Spidering and indexing