Security Requirements to Enterprise Search: part 2

« NIE Newsletter

Mapping Security Requirements to Enterprise Search - Part 2: Implementing the system

Editor's Note: Click to read Part 1 or Part 3 in this series

Introduction to Part 2

In last issue we presented Part 1 of "Mapping Security Requirements to Enterprise Search" we covered:

Introduction to adding security into Enterprise Search engines

Defining some of the vocabulary that will be used

Examined some of the search related requirements companies have

In this installment, we'll be talking about the implementation of such a system, and how some of the more popular vendors stack up. Since some details are vendor-specific and also potentially confidential, we'll stop short of showing actual code.

Vocabulary

Before we dive in, there a few terms that we should review. Though these terms can have broad meanings in the computer and software industry, our definitions here are specific to their use in relation to search engines.

Document Level Security (in search engines)

Security that can be controlled at the document by document level. User A does a search and can find matching documents that he has access to. User B does the exact same search, but sees a different set of matching documents, which are the ones she would have permission to view.

ACL / Access Control List

A list of security permissions associated with a particular document or web page; an electronic representation of who is and is not allowed to see the document. These permissions store the unique ID for each group of people who can see the document; it is also possible to store the ID for specific users (vs. entire groups) though this is less common and overuse can lead to inefficiencies. Some ACL systems also allow for a list of group and user IDs that are specifically NOT allowed to see a document; these "deny" lists typically override all "allow" listings.

ACLs are managed and stored in some other system, such as LDAP or Active Directory pages, or in a content management system.

LDAP and Active Directory

LDAP and Active Directory are standards for storing information about users, groups of users, and other company resources. LDAP stands for Lightweight Directory Access Protocol and is supported by many vendors. Active Directory is an alternative standard supported by Microsoft. Adapters exist to allow systems using the two different protocols to interact with each other.

CMS / Content Management System

Software that stores and manages large numbers of documents. Examples include Documentum, Microsoft SharePoint, Lotus Notes and Vignette. Content from these systems is often indexed and searched by enterprise search engines.

SSO / Single Sign On

A network service that allows an employee to login once and then have access to all secured applications without the need to login again for each app. In order for search engines to implement security, they usually need to interact with one or more of these systems.

Two General Types of Implementation

Document level security, where each group can have access to different documents on a group by group basis, is the fastest growing segment of high end search engine installations. Document level security is used when the simpler application level security workarounds, such as collection level security, start to fail. To have different permissions for each document, you need to have some type of existing ACL (Access Control List) system and/or SSO (Single Sign On) system in place, and integration software from the search engine vendor to connect to it.

The two general ways of implementing document level security are "Early Binding" and "Late Binding" filtering.

Early vs. Late Filtering

Although implementations are vendor specific, there are two primary designs for providing document level security, "early binding" or "late binding" document filtering:

"Early binding" document filtering is setup before the query is sent to the core search engine. Detailed information about the user's permissions are automatically added to the query that the user typed, just before the query is submitted, so that the core engine will only bring back documents that the user can access.

Early Binding Security

Early-binding document security is often more complex to setup, but is strongly preferred since it should provide much better performance and avoids some odd display issues. If the underlying engine understands the user's security limitations, it will only return documents that they can see; time is not wasted on gathering the titles and summaries of the documents that can't be seen. Since the filtering happens at the lowest level of the search engine, it should also happen much more efficiently.

"Late binding" document filtering handles document security after the search has been submitted to the core engine, while the results list of matching documents is being display to the user. Each document's access level is checked, prior to being displayed, against the user's security credentials. The results list formatter will have to check every single document against an external server to see if the user has access.

Late Binding Security

Late-binding document filtering can potentially be very slow, and can strain corporate security systems. Consider a relatively limited access user, who belongs to only one low privileged group. Let's assume, on average, that this user can only see 10% of intranet content. Since most engines show 10 documents on the first page of results, then on average 100 documents will need to be considered before 10 are found that are acceptable to show. So for every search, for every user in this group, 100 documents will need to be checked.

Vendors have many different names for these two systems, so sadly you may need to do a little digging.

If early-binding security is so much better than late binding, why would anybody even bother with late-binding? The answer is that, from a technical standpoint, late-binding was much simpler to design and implement, and until very recently was much more common.

If you think about it, early-binding security requires much more up front work. For each document (or URL or database record, etc), its entire access details must be downloaded and stored into the search index. Getting the detailed ACL info for a document depends on how the document was stored. If a document is stored on a Windows fileserver, then Microsoft based security information for that file must be gathered; any reference to specific groups or users will be references to Microsoft domain groups and users. On the other hand, if the document was stored inside of Documentum, than that content management system must be consulted for user and group information; those user and group references will be specific to Documentum security database, and may have no connection to Microsoft domain groups and users. In a large company, there can easily be a half dozen different document repositories, each with their own idea of "groups" and "users". Gathering ACL information from each of these unique sources, and them mapping each to actual users and groups inside of a company is a complex task.

Whereas with late-binding security, a single question can be asked of any matching document and a user; a simple "yes/no" request is made to retrieve the URL of each document; the user who issued the search has their credentials forwarded to whatever remote system host that particular URL. The remote system will either return the document or not, depending on the remote system's opinion of whether that user can see that document; from the search engine's standpoint it will get either a "yes" or "no" answer, and decide to display or discard that document from the results list accordingly.

In Our Next Installment

Next month we will start talking about implementation. For those of you who were looking for the references to SSO and all the other protocols comprise modern security systems, that's implementation!

Meanwhile, interested readers might start working through your own requirements, nailing things down. It is these types of complex requirements, and the alphabet soup of protocols and vendor products that make these systems a bit tricky to implement, but also very interesting.

And as always, we'd like to hear your thoughts and experiences on security. Which vendor you used, which add-ons, etc. What worked, and maybe what didn't.

Editor's Note: Click to read Part 1 or Part 3 in this series