20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care(Part 1)
Last Updated Feb 2009
By Mark Bennett, New Idea Engineering, Inc. - Volume 5 Number 2 - February/March 2008
The perennial question of what separates Enterprise Search from the more familiar search engines that power the public Internet recently came up again. Dr. Search was planning to do a blog entry but the list mushroomed, and we now present the first in a three part series on the dozens of things that make Enterprise Search surprisingly difficult, and that sometimes flummox the engines that were created to power the public web.
As we hinted above, the public Internet was the inspiration and proving ground for a majority of the commercial and open source search engines out there. Solving that technical problem, indexing the Internet, has influenced both the architecture and implementation, as engineers have made hundreds of assumptions about data and usage patterns – assumptions that do not always apply behind the firewalls of corporations and agencies.
When vendors talk about their products, features and patents, they are usually talking about technology that was not specifically designed for the enterprise. This isn't just academic theory - as you'll see, these assumptions can actually break enterprise search, if not adjusted properly.
|[back to top]
A Few Logistics
We've divided our list into "technical issues: user facing", "technical issues: back end data and indexing", and then "business and strategic" differences; we're doing the "easier" technical stuff in the first two parts, with the strategic and biz stuff as the finale. There's a bit of overlap, as some issues can be viewed from both a business and technical perspective, and data/indexing issues can affect what the user sees. Of course not every item applies to every project and vendor, "your mileage may vary". And heck, you may already know some of these, but we're trying to be quite comprehensive in scope, though perhaps a bit brief on some items. If anything catches your eye, that you'd like more details on, please drop us a note. And we've decided to let you do your own "numbering", this isn't late night TV after all.[back to top]
Defining "Enterprise" for this article
To be clear, when we say "enterprise" search, we are referring to both the search engines that power private Intranets and Extranets, and to a lesser extent, the engines that companies have purchased to power their commerce and customer facing web sites. Broadly, "enterprise" search could be thought of as "all search engines EXCEPT the public Yahoo, Google and MSN", since you DO own and control the search engine that powers your public web site or online store. And again, your usage patterns and priorities are likely different from those of the Internet portals.
With all that said, let's get started![back to top]
|Part 1: Outline / Contents [Part 2] [Part 3]
High Level Internet / Intranet Mismatches
These are some differences viewed from the broadest 10,000 foot level. We'll revisit some in more detail later.
The Enterprise is not just "a small Internet"
Imagine if you powered the Internet, and had a brand name that rivaled Coca-Cola. And then, imagine if you took all of that wonderful technological goodness with the wonderful brand name, and stuffed it into a brightly colored rack-mounted box. You would assume that, if you could handle the Internet, then of course you could handle a relatively puny private network - it just makes sense! You'd believe it, and so would your customers. To be fair, this was Google a few years back; their v5 appliance has clearly evolved beyond this simple model.
These seem like perfectly sane and compelling arguments, and this model has worked at some companies. If your Intranet has a few dozen (to a few thousand) company portals and departmental web sites, which mostly contain HTML and PDF documents, this would possibly work for you.
Or, suppose you had a portal that powered all of the Internet back in the 1990s. Slap that software on a CD, give it a nice Web based admin GUI, and ship it! This was actually the start of several well known search vendors. These products have also been iterated to add on enterprise functionality.
Ultraseek had been a great choice for more generic enterprise environments, and included some customization. Lately the Google Appliance is filling that segment, and can scale to reasonable sizes.
In contrast, some engines were not created for the Internet, but were always targeted at more specific business applications. As an example, DieselPoint was created to serve complex parts databases from its very beginning. It can also spider and search HTML and other document formats, but that was not its genesis.
Not just WPIWPO and "Search Dial Tone"
We define generic search engines as "Web Pages In – Web Pages Out" (WPIWPO). Basically, a spider crawls and indexes generic web content, and then the users run their searches from generic web browsers (such as Internet Explorer, Firefox, Safari, etc.). In the enterprise, however, content comes from many other sources, such as Content Management Systems, databases and archival storage appliances, etc. And users are not always running searches from a web browser – more on that later.
When you have all the pages indexed and basic search up and running, you have achieved "Search Dial-Tone". Nothing fancy, but basic search functionality is online.
Documents vs. Data
Modern search engines employ "fulltext" search, looking for specific search terms in relatively unstructured text. The unstructured nature of the data is the key; it was assumed that most content would be composed of paragraphs of text, with very little formal structure, verses the more traditional fields in a database, with their more rigid INT, DATE and CHAR datatypes. About the only assumption made about a document's "structure" was that it would probably have a Title of some sort. References to specific numbers, colors or geographic locations may be blindly treated the same as every other word in the document. (See also Entity Extraction.) Hopefully, if a term appears in the title, it will be given a bit more weight, but that's about it in the basic world of documents.
Fulltext searches are also much more free form. Think about how much easier it is to type a search into Google, verses creating an old-school SQL SELECT statement.
But Enterprises DO have data, lots of it, and it is often structured. And yes, they have millions of textual documents too. Enterprise search is often called upon to search across both textual documents and database records, and often as part the same user's search. Most modern engines support this type of hybrid search, and in many cases fields can be used to filter out extraneous matches and focus in on a particular set of results.
Some companies have content with both textual content and hundreds of "fields". For example, imagine a parts database for the airline industry. There would be descriptions of the parts, plus lots of meta data concerning part size, manufacturer, materials, applicable aircraft model, inventory levels, etc. Note that, in the document-centric world, fields are referred to as "meta data" or "attributes". A mechanic may want to search for "landing gear brackets" for the Boeing 747, made out of titanium, and that are less than 5 years old. This calls for a hybrid search, and possibly faceted navigation. (More on that later.)
Technical Differences between Internet and Enterprise Search
Here in Part 1, we'll focus on the "easier" technical differences. We've broken them into "User Experience" at the high and low level, and a section on data and spidering issues.
User Experience: High Level
As we've said, the way users interact with a corporate search engine may be quite different than the casual use of Yahoo and Google on the web.
Single Shot Relevancy – Over-Reliance in Both Markets
Long time readers will recall that we've warned before that an over-reliance on "relevance" can belie a flawed usage assumption, that ANY engine can find the "correct" answer for the typical 1 or 2 word search. On the Internet, when I type "bush", how could the system know with certainly whether I'm referring to President George Bush, junior or senior, or the Australian outback, or one of the many beers or musical groups with "bush" in their name (with or without that exact spelling), or perhaps a shrub to plant in the yard. In the enterprise, terms like "resource" or "schedule" are similarly ambiguous.
Ironically, this first issue is something public and private engines often have in common. Which leads us to…
Enterprises Often Need Faceted Navigation
Data in the enterprise often has more meta data, and can therefore allow users to drill down into search results; some vendors call this parametric search. We're talked about these terms in previous articles.
If the docs lack meta data, but are at least organized in some type of overall logical order, then a taxonomy might work.
If the content is completely unstructured, a sometimes-acceptable alternative is to use unsupervised clustering, but we usually view this as a workaround – if you have good meta data (or database fields), or at least some type of overall rule-based organization, faceted / parametric navigators or taxonomies will usually give more pleasing results.
Google Relevancy Trick "Broken" Behind the Firewall
This isn't to say that Google's code is buggy, not at all!
But Google's main improvement years back, on their public Internet portal, was to consider all the links to each web page in their ranking calculations – pages that were referred to by more sites were presumed to be more authoritative, and were therefore given a higher score.
This is sometimes called "organic linking", in that nobody controls the public Internet, and people frequently link from their web site to other sites based on their personal opinions. If you plotted all of these links on a graph, it might look a bit chaotic, like thousands of little roots or tentacles, but it really did give a decent approximation of human-determined "goodness" for each page. We're seeing a resurgence of this with blogs linking to other blogs.
But within corporations and agencies you don't have millions of users randomly creating links between pages, based solely on their personal preferences. Instead, the links between web pages inside a company are more orderly, and tend to approximate a logical org chart of sorts. It may seem more orderly, but it actually encodes less human-derived wisdom.
Thus, when the Google Appliance indexes data on a private network, it doesn't have the same advantage that it does on the public Internet, and therefore performs more on a par with other commercial search engines. It's not "bad", it's just not amazingly better than other guys.
Of course Google realizes this, and we suspect they have been trying to tweak their algorithms to compensate. And of course there are SOME links on company Intranets, and some employees do cross link pages. But in search engine industry insider speak, they are put back on the same "TF/IDF" playing field as everybody else. (See http://en.wikipedia.org/wiki/TFIDF)
Web 2.0 Techniques Not Directly Applicable
Social networking, "wisdom of the herd", blogs, the "user" as "man of the year". If you read any web or search related publications, you're very familiar with all these terms.
These trends have been used to enhance search on the World Wide Web, social networking sites, and leading ecommerce sites. Google's link ranking could even be considered an early form of this trend.
But these techniques have not worked as well on private networks. There are a bunch of technical reasons which we will detail in a future article. But here are a few examples:
This is where users add descriptive keywords to documents, photos or video. The search engine looks at these tags for future searches.
Of course in the enterprise, where there are typically more documents than photos, and documents already have text, so this doesn't have as big of an impact.
But more importantly, on a public web site with millions of visitors, even if only 1% of them add tags, that's tens of thousands of users adding tags on a regular basis. In a company with 1,000 employees, that same 1% equates to only about 10 employees adding tags. They will certainly tag some important documents, but whether or not they've flagged the best document to match the search you're running at the moment is far less likely.
So directly applying tagging "as is" in the enterprise isn't as likely to make a big impact.
However, this method could be re-designed to be more effective. On public sites like Flickr or YouTube, users make a specific decision to add tags to a photo, we call this Explicit Tagging.
In the enterprise, when users type in search terms and then click on a document in the results list, they are still telling you something about the document they selected. They typed in some terms, and at least the title of the document led them to believe that it was related. There is a mild implication that this document is related to the search terms. We call this type of automatic tagging, based only on results list clicks, Implicit Tagging.
It's not a perfect replacement. When they opened the document, they may have realized that it was NOT they wanted, and then quickly closed it. They might open up 5 or 10 documents as the result of a search, and all of those documents would be implicitly tagged with the same search terms, so this dilutes the value of the association.
You could imagine systems that try to work around this. Perhaps there's some way to time how long the user has the document open, before they close it, the implication being that the longer they had it open, the more interesting it was. Of course if they left it open because they got a phone call, then that too is a false assumption. Or perhaps you make a stronger association between the search terms and the first document they open, and give less and less weight to subsequent documents that are opened.
None of these implicit tags seem as good as the classic Web 2.0 explicit tags, where the user goes out of his way to tag a document, and gives some thought to which would be the best terms. But it's still likely better than no tags at all, if you're a fan of tagging in general.
We've even proposed to clients that they implement BOTH Explicit and Implicit Tagging, and give much higher weight to the classic explicit tags.
But the point here is not about tags, it's that a Web 2.0 technique needed to be rethought and modified for enterprise search.
Another example is the strong weight given to blogs by public search engines. Terms that suddenly appear across many blog entries are likely to be about a new subject or event, and therefore can be given preference.
But what if your enterprise didn't have any internal blogs? Some companies have even banned employee blogs. Instead of thinking about "blogs", perhaps your search engine could pay more attention to the text of newly submitted bugs and technical support requests. The idea is to give preference to the data sources that track what's new and what people are talking about, a class of applications called Lightweight Publishing (LWP).
Again, the point is not whether or not bug reports should be treated just like blog entries, but the fact that this is another search related Web 2.0 feature that may need to be re-thought before applying it to Enterprise Search 2.0.
Speaking of Blogs, an aspect of Web 2.0 that may be superior in the Enterprise 2.0 search space is that of assured identities. On a public web site, you don't always know who might be posting comments or spam, or perhaps accidently spreading incorrect information. But in the enterprise, most employees have a validated login. An employee's contributions can be tracked, corrected, or perhaps even rewarded. Going further, the search engine can also look at the employee's job title and geographic location to automatically add additional context when ranking results.
It's All About Re-Tooling
Re-tooling Web 2.0 techniques to fit Enterprise Search is so important that we plan to do a dedicated article on the subject in an upcoming issue. Please email us if you have interest in this topic or would like to be interviewed.
Different Specialized Search Clients
We mentioned the abbreviation WPIWPO (Web Pages In – Web Pages Out) in the opening section. On the Web Pages OUT side of things, there are many types of search applications in the enterprise that do not involve a user sitting at a workstation using a generic web browser and search form.
Enterprise users have lots of uses for search results. Here are some examples of how a hypothetical enterprise might employ search in other ways:
The Call Tracking system that Tech Support uses to log incoming calls could call out to a search engine to look for a resolution, and present the results directly back into CallTrack. It could even copy the problem description, by default, into the search box. And if a tech support rep finds an appropriate FAQ or bug fix, they could "link it" to this call with a single mouse click.
A paralegal might use a custom application to search over many different legal databases, some internal, some on the web, and some involving premium subscription-only content. If pertinent items are found they could be linked into the case, or flagged for further review by a senior partner. And whenever a result is cited, a PDF snapshot could be automatically captured and logged, including case number, the paralegal's employee ID, the date and time, the data source of the cited record, and the search terms used. The paralegal could return to this case file at a later time, or collaborate with other employees on larger cases.
A company could leverage the natural language processing and entity extraction features of a search engine to track trends in overall search activity and newly added documents. Instead of performing specific searches, this BI system is used to spot shifts in behavior or interest, and to spot new statistically significant vocabulary terms.
The IT department might decide to use the search engine to run simple reports, instead of using the traditional relational database. This trend is referred to by some vendors as "database offloading".
And even just considering web interfaces…
The researches in R&D might be very sophisticated search users and they might want a more detailed search form, with lots of fields, date ranges, zone searches, etc. The same corporate search engine can expose different search forms to different users. Casual users get a simpler form that looks more like the public Google home page. Whereas more experiences or specialized users get a more complex form with lots of options.
Or suppose HR would like a web search form for their department's homepage, that just searches through HR policies and procedures. In the old days, they might have gone out and bought their own search engine. But now, they can have their own custom search form, living on their department's site, that still uses the central search engine's horsepower. And the form sets a filter, as a hidden field, that restricts to the search to just the HR URLs. To a user, it looks like HR has their own search engine which just searches the HR policies, but on the back end it's just another web form using the main search engine.
We'll try to give more examples in an upcoming article.
If a company has multiple search engines installed, an employee looking for something could visit each search engine separately and run the same search. But this is very inconvenient. It would be much more efficient to give the employee a single search form to enter terms into. Then that search is submitted to all of the engines simultaneously and the various matches are combined back into a single results list. This is what Federated Search attempts to do, and has garnered quite a bit of publicity lately.
However, most of the demos of Federated Search only show search results combing back from multiple public search portals. For example, a search is submitted to Google, MSN and Yahoo, and the results are stitched back together in some way. While this is handy for public search, it falls far short of what true Enterprise Federated Search would need. Some vendors do realize this and are adding additional features.
Examples of features required for Enterprise level Federated Search include:
- Flexible rules for combining results from all of the engines searched
- Maintaining Users Security Credentials
- Mapping User Security Credentials to other security domains
- Advanced Duplicate Detection and Removal
- Combining results list Navigators, such as Faceted Search links and Taxonomy Nodes.
- Handling other results list links such as "next page" and sort order.
- Translating user searches into the different search syntaxes used by the disparate engines.
- Extracting hits from HTML results, AKA "scraping", hopefully without the need to custom code.
We've explained these in more detail in the Sidebar "Advanced Federated Search Functionality.
Sidebar: Characteristics of Federated Search in the EnterpriseHere we explain, in a bit more detail, about what those Federated Search bullet points actually mean. As always, email us if something's not clear. [back to top]
Flexible rules for combining results from all of the engines searched
At the most basic level, there should be various options for how the results from remote systems are combined. In some cases, users might need a single, simple results list, with matches from all engines mixed in, and properly calibrating the different relevancy scores. In other cases, when it's too hard to compare relevancy scores from vendor A and vendor B, it might be better to present the top few results from each under a separate heading or in a separate tab. Other vendors may offer a simple "round robin" approach to combine results.[back to top]
Maintaining User Security Credentials
Maintaining the user's security credentials across all of the systems being searched. So if I logged in to my corporate SSO as smith, then my jsmith SSO session ID needs to be passed through to all of the search applications that are SSO-aware.[back to top]
Mapping User Security Credentials to other security domains
Mapping the user's security credentials to systems that have separate logins or security domains. For example, though the SSO system knows me as jsmith, my login for the old Lotus Notes system might be js1234. And my subscription to premium content on LexisNexis might be under email@example.com The Federated Search system needs to handle this mapping automatically and seamlessly.[back to top]
Advanced Duplicate Detection and Removal
The Federator should be able to spot duplicate results returned by more than one engine. This is relatively easy to do if search results always have a comparable URL. But sometimes the document key from each engine has a radically different format. Imagine, for example, that my CMS returned a match, with a document key of "DOC_OBJECT_45678". My corporate Google Appliance had also spidered the CMS data, and found the same key, but only has a URL key of the form "https://cms.corp.acme.com/ViewItem?view=7272&object=45678&session=a492bc79184ef18"
Clearly both engines have returned the same document, but the federator would need to transform at least one of the document id tags before doing a comparison. And even if all that were handled, there's still the question of which of the duplicates to show and which to get rid of. Suppose one of the matches was a report written by one of my coworkers, from our own CMS. And suppose that same article were also returned by a premium content provider we subscribe to, where if I click on it to read it, my company will be charged for accessing that document. Clearly it would be better to show me the version from our CMS engine for free, rather than the premium content, so business rules need to be applied when decided which duplicates to keep or toss.[back to top]
Combining results list Navigators, such as Faceted Search links and Taxonomy Nodes.
Yet another challenge for Enterprise class Federated Search is combining results list navigators. We've said before that engines should not just rely on Single Shot Relevancy, that users should be given clickable navigators in the results list so that they can "drill down" into the section of matches that are more interesting. Suppose a Federated Search returns thousands of results to a user query, and that search engine A returned clickable facet navigators of "White Papers", "FAQ's" and "Corporate Library", and that search engine B returned taxonomy branches of "Sales / Northeast", "Sales / Midwest", "Marketing / Press Releases / 2007" and "Engineering / Bug Reports". It would be really difficult to combine these into one coherent list of clickable choices.
But let's consider an even simpler example. Suppose a Federated Search is run against two different "parts" databases in my company, perhaps System A shows matches from our North American electronic components inventory, and System B has aircraft parts stored in our European warehouse. Let's assume I do a search for "bracket", and both systems return thousands of results. And, let's make the navigator facets align more closely, and say that BOTH systems included facets for "Material = Aluminum".
That would certainly be an easier set of navigators to combine. We can take the 350 aluminum brackets from System A and the 75 aluminum brackets from System B and show a unified clickable facet. The facet's header will be "Material", and one of the hyperlinked choices listed underneath will be "Aluminum (425)". Notice we've even combined the counts, 350 from System A and 75 from System B. That all looks fine, but now…
What happens when a user actually clicks on that navigator?
Navigators are typically rendered as hyperlinks, with CGI parameters tacked on. If you moused-over that link, you might see a CGI parameter somewhere in the URL of the form …&material=aluminum&… But that link points back to our Federator, and it has to remember that those 425 were actually 350 matches from System A and 75 matches from System B. The good news is, this is theoretically possible to handle, and instead of "remembering" that this represents results from two systems, the Federator will likely just reformulate two new queries, one for System A and one for System B, both of which have the additional "material=aluminum" filter. But this is still a really complex system. And perhaps System A calls this attribute "material", whereas System B calls it "primary-alloy".
Just for fun, suppose System A uses the common name "aluminum", whereas System B uses the chemical formula from the periodic table, in this case "Al". Obviously not every system is this complex! But the point is that an industrial strength search federation engine would need to have a plugin for rules concerning navigators, and those rules would need to allow for predefined mappings of facet names and values.
To date, no search or federator vendor seems up to this challenge, or they would "well, you'd write a custom plugin to handle that". A fair answer, but a far cry from the snazzy public demos and easy configuration showing federation of results from Amazon, Flickr and Yahoo. We believe the companies who are really pushing Federation as their primary product need to include more configuration options to handle scenarios like this without the need to break out a Java or C# compiler.[back to top]
Handling other results list links such as "next page" and sort order.
Another type of results list navigation link which can get complicated in the federated model are the "next page" / "previous page" links. If I'm showing you combined results from four different search engines, what should happen when you ask to see the "next page" of results? There are answers to this question, of course, but they are not always trivial. Allowing users to change the sort order of results has similar technical challenges in a federated environment.[back to top]
Translating user searches into the different search syntaxes used by the disparate engines
Another challenge for high end federated search are the different search syntaxes used by the various systems that are to be searched. Using the previous example of looking for an aluminum bracket, all the user has done was type in the word bracket into a search box, and then drilled down by clicking on the "aluminum" facet.
On the back end, this needs to be translated into the native syntax for each engine. For example, assuming 3 back end engines, the native syntax might be:
Most vendors refer to this as Query Transformation.[back to top]
Extracting hits from HTML results, AKA "scraping"
And a final hurdle, some of the remote engines being searched will just return simple HTML. The Federator needs to parse the results list, to pull out the actual documents; this is often called Screen Scraping. Although most vendors would support writing a custom plugin in Java, C# (or maybe Perl/Python), it would be better if there was some type of configuration language that made it easier dissect the HTML results, without the need for coding. For example, our own XPump language includes various high level pattern matching extractors.
As you can see, this is an incredibly complex topic. The simplistic demos using only public data sources that most vendors show don't even begin to scratch the surface of real Enterprise Federated Search level capabilities. And when the answer to every tough question is "well, you could write a custom plugin/connector to do that" and other vague hand waving;, it means that quite a bit of this complexity will be your responsibility to figure out, or may add lots of dollars to your "Professional Services" bill. We're not saying "don't do it!", if you really need federated search it IS possible; we're just suggesting that you keep your eyes wide open, and ask LOTS of question after they show you their Flickr/YouTube federated demo.
User Experience: Lower Level Differences between Internet and Enterprise Search
This section focuses more on the low level "bits and bytes" and behind the scenes technical issues that ultimately impact what the user sees. If you get these wrong it won't be directly noticeable to your uses, but they will tend to NOT find what they are looking for, and continue to claim that "search sucks".
Vocabulary and Thesaurus
Finally some good news! Virtually all enterprise search vendors support some type of custom thesaurus capability, and often this provides the biggest "bang for the buck" in terms of minimal effort and greatly improved results. A related feature would be the search engine's spelling suggestion system.
Within a business or agency, there is typically a much narrower vocabulary in use than that of the public Internet. And there are typically "domain experts" within the organization that know all about the custom terms and abbreviations commonly used. Since companies control the search engine, they also have access to search activity logs, which can be another source for finding common search terms and misspellings.
A frequent problem with search, but one that can be detected and fixed, is Vocabulary Mismatch. This is where the subject experts and authors of the content are using one set of terms, but casual searchers are using a different set of terms. Perhaps users are not aware of important abbreviations, or misspell a word, product name, or model number. One abandoned idea is to educate the users to use the correct terms – this is rarely practical. Another basic fix is to get the content authors to use some of the same terms that users use; this works sometimes, assuming you have control over the content, and the resources to update older content. But things like common abbreviations and acronyms used within an organization are so pervasive, you're never going to catch all of them, and a well-maintained thesaurus can easily fix these problems.
We do remind clients to periodically compare their thesaurus with their search activity logs; terminology certainly changes over time, either due to recent events or new product launches, etc.
You Can Monitor "Performance"
More good news. If you own the search engine, you can also monitor its response time, usage patterns and search activity much more closely. In modern, well functioning search engines, your users should get back results in a second or less the majority of the time. If, on the other hand, you're seeing searches routinely take more than 2 seconds, or heaven forbid, more than 5 seconds, it's time to call IT or the search engine vendor.
And as we have pointed out before, great Search Analytics starts with great search activity data logging. You have a lot more control about the level of detail that gets logged (and possibly greater liability!)
In business, the quicker you spot problems, the quicker you can fix them! Contrast this with the search on the public World Wide Web. Unless you work for one of the big portal companies, there's not much you can directly do about problems that you spot. You can blog about it, or rant about it in your Enterprise Search Newsletter, but that presumes that somebody at the portals is actually listening.
Search Syntax May Be Different
On the public Internet, advanced users who want to tweak their search results can edit their query and resubmit it. Most modern engines on the web understand the plus sign (+) to mean "required", and the minus sign or hyphen (-) to mean NOT or "exclude". Many engines also support the colon (:) to indicate a field or zone search, for example title:California might be interpreted to mean "search for the word California, but only if it appears in the title" In addition to advanced users, programmers sometimes use this syntax when automatically generating complex queries in custom search applications.
Most enterprise search vendors have adopted similar query parsing rules, although this is not universal, and there are some subtle differences. Most vendors also have a more syntax rich native language, which may look a bit more like SQL. But unlike SQL, there is no standard syntax shared by most vendors; standards have certainly been proposed, but not widely supported.
Vendors vary more widely in terms of how Boolean operators are expressed, or whether sub-queries can be nested inside of parenthesis. For example, the Google appliance supports both + and the word AND, the older Verity formal K2 VQL syntax was <AND>, and Ultraseek used | and || (single and double vertical bars). That last one is not a typo! Unlike C-derived syntax, which equates the vertical bar with "OR", Ultraseek used that character to mean "AND", more reminiscent of the Unix pipe command. It's not important that you remember this, unless you're a search consultant. The point is that it can vary widely between vendors. This is an issue that developers working only with one or two public search engines may have never faced.
A more subtle difference has to do with the default Boolean operator. On the public Internet the "implied" Boolean operator is typically an "AND" if you type in multiple search terms. Whereas Enterprise search engines may default to an OR when searching for multiple terms, and this policy can sometimes be overridden.
An asterisk (*) may invoke a wildcard search but not always, there may be restrictions or configuration changes required.
Case sensitive searching is supported by some vendors, but not by others, or may require additional syntax to specify. For example, my first name is Mark, but the generic word "mark" can also be as a noun or verb in English. Some engines would allow me to search for just "Mark", and ignore the lowercase version; other engines do not. By default, most engines are case-insensitive (they ignore differences in upper and lower case). A few engines change their default, depending on whether they see both upper and lower case letters in the search.
Punctuation may be an important part of the actual Search Terms
People using public search engines don't usually worry about punctuation. But enterprise users and technically savvy customers may need to lookup specific model numbers or error messages that include lots of punctuation. An engineer might think to include such a term inside of double quotes, but even that doesn't work on all search engines. And imagine a company with a part number designation of XY-420; customers and documentation might refer to that as XY420, XY 420, Xy 4.2.0, or perhaps they own an "XY-420a"; correctly matching model numbers and version numbers can be particularly vexing in some applications. Even tougher, clients have even told us about customers copying and pasting error logs and stack traces directly into their site search box. Chemical equations, units of measure and mathematical symbols might also be critical to some specialized search applications. Other classic examples are how to properly tokenize and search for things like PS/2, AT&T and X.25.
There is a rather long explanation for the different ways this can be handled, often depending on which vendor/engine is being used, and there may be some overlap with Entity Extraction. It's too much to cover here. But suffice to say, if you have a search project where punctuation is important, it is not safe to assume your search engine will handle it correctly. This will require some research and thorough testing to insure proper operation.
Control over Duplicate Detection and Near Duplicates
For various technical reasons, duplicates are almost never identical, but still perceived as a duplicate by a casual human observer. Of course there can be "false positives", documents that appear very similar at first glance, but which are in fact actually different, and both of importance. So even now, we're left with computers trying to guess which things are really duplicates, verses which things are actually different.
There is a related problem with detecting duplicates when using Federated Search, see the sidebar.
Most public Internet search engines play very fast and loose with duplicates. More than ten years ago, public World Wide Web search engines started erring on the side of "perception" – if two things would look like a duplicate to a user, it's best to remove one of them, sort of "When in doubt, throw it out!" Users complained loudly about duplicates cluttering up the results list, and in those fiercely competitive days, portals didn't want to lose "eyeballs" so they complied. And the portals figured that, even if a few non-dupes were accidentally tossed out, users would never know!
To be fair, this was market driven, and some public search portals such as Google even give you option to see of the results, dupes and all. It's not the default, but if you want to see them, you can.
But this is in drastic contrast to the situation many companies face. Imagine a law office that uses boilerplate contracts with many of its clients. The text will tend to be very similar, but every contract is important, and you certainly don't want some computer randomly guessing which documents to ignore - there could be dire legal consequences for such computer chicanery. Yes, in this specific situation, there are technical solutions to address it, assuming anybody thought to ask.
Or imagine a software company with a mature product line that has gone through 10 major revisions over the years. Each version came with revised instruction manuals in PDF and HTML. To a computer, some of these versions are likely to seem almost identical, especially for chapters covering features that have not changed that much. But for somebody working in Tech Support or QA at this company, you would not want the computer to start randomly guessing which version to show them; they most likely want the most recent version or a list of available versions. But what they probably don't want is for the engine to quietly show them the version from 5 years ago.
Returning to our law firm scenario, imagine that upon discovering that duplicates were mistakenly discarded, the firm decides to just turn off the duplicate detection in their search engine, better safe than sorry! But this could cause other serious problems. For example, a web page on their Intranet might be referenced by more than one URL. The company holiday schedule might be referenced as http://corp/holidays, https://corp/holidays/, http://corp/holidays/index.html, and http://corp.acme.com/holidays/index.html If many of the pages on their network had 3 or 4 URLs, turning off duplicate detection could dramatically impact search results for the worse, and make all employees substantially less productive (and aggravated). Again, in this scenario, there are probably ways to configure the search engine properly handle both issues, to remove duplicates that are simply caused by different versions of the same URL, while still ensuring that all copies of customer contracts are properly returned. But somebody will have to understand these issues, talk to the vendor or consultant, and then test the configuration changes. It's not rocket science, but it may not be "trivial" to fix either, and this is something that a casual public Internet search user might never have encountered.
You can see that the problem is not only detecting which things are really duplicates, vs. which things are almost duplicates, but also then deciding which version should be show by default. And you might want to give the user additional indications that other versions do exist.
Different Security Requirements and Infrastructure
On the public World Wide Web, there's not much of a connection between security and search. You may have logins for certain web sites, and you may use https when buying something with your credit card, but that doesn't have much to do with a general search that you might do on Google or Yahoo. Yes, a few subscription sites limit what non-subscribers can see, or may require an account to read the full text of a document.
But as we've mentioned in previous articles, and in the section on Federated Search above, inside the enterprise security can be heavily tied to search. In larger organizations, your company login may be used to restrict which documents you can see in the search results list.
Generally speaking, the security infrastructure and protocols tend to be different between the public web and the enterprise; and when security is needed in search, the search engine must integrate with the available security infrastructure and protocols. The Internet and Intranet both use SSL and HTTPS.
On the web, newer systems are evolving to allow for electronic payments, so for example you can use PayPal on many different sites. The SAML standard, based on XML, is becoming more widely adopted. And there is evolving the general concept of "distributed identity assurance". A participating site could verify that I am a California resident, for example, without requiring me to create a new account on every site that needs to know that.
But on private networks, Single Sign On (SSO), LDAP and Active Directory are still the norm. A search application that cares about security will likely be using those protocols. A few enterprise apps still use the older "application level security"; search engine applications can do likewise. We haven't seen the distributed identity assurance model taking root in corporations yet, they still seem to prefer a tightly controlled central resource. One exception to this apathy is between different government agencies; they are starting to realize that government employees frequently need data from other agencies, and that distributed, cooperative security systems make this much more efficient, so that government employees don't have to keep creating new logins for every agency they visit.
There are additional security related items to consider when dealing with Federated Search, see the Sidebar.
In Our Next Issue…
In our next installment, we'll finish up the technical differences between Internet and Enterprise search, with an emphasis on the source data that is to be searched, and the spiders and connecters that are creating the search index.
We'd love to hear from you. This is a very broad topic, but your projects may be concerned with a few key areas, so drop us a line!