20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care (Part 2)To read Part 1 of this series, look at the February - March issue.
|
||||
Part 2: Technical:
|
Part 2: Outline / Contents
[Back to Part 1]
|
|||
Content is not always static or "quiescent"Search engines assume that for the most part, once a document or web page is created, it will rarely if ever be modified, but highly specialized applications may change documents or database records frequently. In many search engines, changing even one little attribute of a document causes the entire document to be re-filtered and reprocessed. [back to top]Examples of rapidly changing content might include:
All of these systems break the mold for static content search engines. [back to top]
Server Side Dynamic ContentDynamic content presents some specific problems for both Internet and Enterprise spiders. For example, in our previous installment we talked about detecting duplicate and near duplicate pages, and dynamically generated pages are often associated with such problems. But a more basic problem is how the URLs for the dynamic pages are exposed and whether the spider will follow them, and whether or not the content requires a POST. If you have dynamic content, you should not blindly assume that it will be handled. [back to top]GET vs. POSTHave you ever noticed the weird URLs at the top of your browser? When you see a long URL, especially if it contains a question mark (?), ampersands (&), and equal signs (=), you are dealing with pages that are being dynamically generated by a CGI program of some sort and that program is using a particular style of communications called an "HTTP GET". This GET style of CGI communications could be used in both traditional underlined hyperlinks or from an on-screen form. This type of dynamic content is usually ignored by spiders. A few specialized vendors do handle this type of situation, often referred to as accessing "deep" content, or "deep web" spidering. [back to top]Exposed CGI vs. Hidden CGIWe may know that we are dealing with CGI scripts because we see either special punctuation characters in a long URL or because there is an onscreen form to fill out and submit. But it's also possible for CGI content to be mapped into normal URL space. The good news is that spiders will generally follow these links, and this may even be one of the primary motivations for this type of URL. But there's still one other potential problem. Imagine that a calendar application always presents links to the next and previous days, and the next and previous months, and the next and previous years, even if there are no events scheduled yet. I once helped adjust a spider that had indexed every empty month, week, day and hour all the way through the year 2039 before it gave up – in this case it had used up the license page count. So the problem is that masked CGIs can still create what is called a "spider trap", a set of seemingly valid links that is essentially useless and infinite – that will never run out of links once it strays into such a trap. Smart spiders know how to mostly avoid these traps automatically, and the spider administrator can train other spiders to avoid some specific traps. If your site uses masked URLs, there may be some trial and error as you adjust the rules, and it may be an iterative process as you rescue the spider from one trap, only to have it fall into another. It gets a bit more complicated if your boss wants "current" calendar events spidered, but not empty schedules from 2039. |
||||
[back to top]
Server Side vs. Client Side Dynamic ContentWe've been careful to keep mentioning the web server in all of this. With the advent of Java, Java Script / AJAX / JSON, Active X, Flash and an alphabet soup of other browser technologies, client side scripting is also very popular these days. All of these technologies run on the local computer's (or phone's!) browser, vs. the traditional CGI that runs back on the web site's server. The problem is that a majority of spiders do not handle this type of content at all, or offer very limited support for it. We have seen a few that offer limited Java Script and Flash support by using specialized parsers or by embedding a virtual machine inside the spider, but even then the penetration through multiple levels of navigation is severely limited. |
[back to top]
| |||
[back to top]
RDBMS: Database records aren't just "web pages without links"In the typical Internet spider, everything is a web page and has a corresponding URL. In classic search engine terms, we'd say that the web pages are the "documents" (the unit of retrieval). When search engines are sent after a database, it's common to map database records from a main table into search engine documents. So, for example, if a database has a catalog of 100,000 products, the search engine will wind up with 100,000 product documents. There are a number of technical details that are different enough from a normal web crawler.
Please refer to the full text of the white paper (when available) for a complete discussion, or drop us an email. [back to top]Your Data – Document LevelThe documents being indexed inside corporations and agency firewalls can also deviate from the run-of-the-mill web pages and PDFs that public spiders typically handle. One assumption is that one web page or file equals one document, where a "document" in search is the basic unit of retrieval. While this is often true, particular datasets and business logic can run counter to that assumption, and if not compensated for, can compromise search results accuracy and relevance. [back to top]Compound DocumentsIn some cases, one large file may contain a number of smaller documents. A file that contains multiple logical documents but has been incorrectly indexed as a single document will tend to match more searches than it should, so when users open that document, they may be confused as to why it matched. Compound documents may also be ranked incorrectly. Some examples of Compound Documents include:
The last few items are for rather specialized applications and are often not required. [back to top]Composite Documents and Meta DataSometimes the data that constitutes a single logical document is actually stored in several places, and must be joined together before being sent to the search engine for indexing. Sometimes different parts of the document exist on different systems. An example of this could include a digital media and assets tracking system, where the main drawing or clip are stored on a media server, but where the description and specifications are stored in a database. A more common case of Composite Documents is where the main document resides in one place, but meta data for that document must be drawn in from other sources. meta data often comes from external sources, typically from within the document itself, an external source, or can be parsed, extracted, or inferred from the document text or another available field. Meta data from all these sources may require further processing to validate and normalize it into a consistent format or to use a consistent name for each item across all data sources. Some specialized applications and data sources can have well over a thousand different attributes, all of which need to be properly indexed, which is in stark contrast to generic and Internet search. An example of this type of system would be a set of product documentation stored on a file server folder hierarchy, but where additional information about document versions and releases are stored in Oracle. At index time, the files from the file share must be combined with the results of an Oracle query before being submitted to the search engine indexer. [back to top]Even More Document Level ProcessingIn addition to the work that can be done on documents as they enter the system together and normalize meta data and entities, there are some higher level functions that can be performed. Through an extension of entity extraction, rules can be created that cover the entire scope of a web page or document. You can make assumptions about data based on its location or form, and extract other data. High-level rules also help identify which parts of the text are the central content, versus ancillary page navigation links or copyright notices. This helps fix the problem where certain search terms match every page on a site, because the term is used in a site wide navigator. Yes, smart engines automatically devalue a term that appears on all pages in relevancy calculations, but if the user entered that term as single word query, the engine has little choice but to return all documents, and it will have only minor variations in the number of times the word appears on each page to do it's rankings. Explicitly trimming off the common, non-central parts of pages before sending them to the indexer is more precise, but of course developing and testing such rules for a broad range of pages can be time-consuming. [back to top]Pairing Data with NavigatorsOne other consideration about your data, at both the individual document level and at the overall organization of it (the "corpus" level), is that of pairing your data with the appropriate results list navigator technology. Vendors offer a staggering selection of clickable, results list doohickeys for users to drill down their searches on, technologies with names like "parametric search", "faceted navigation", "tag clouds", "taxonomies", "automatic clustering"… the list goes on and on. To the casual search user these all look about the same, but the implementation of these various techniques is radically different on the back end. The reason you should care is because different techniques are best suited for certain types of data. Some techniques work better than others, but may require more structured data up front or document preprocessing at index time. [back to top]Coming Up In Our Next Installment…The last installment of this 3-part series will review a number of additional technological points and will discuss some of the business implications of the differences between Enterprise and Internet search. New Idea Engineering always welcomes your questions and input. Feel free to contact us at info@ideaeng.com [back to top]
|
||||