Security Requirements to Enterprise Search: part 1

« NIE Newsletter

Mapping Security Requirements to Enterprise Search - Part 1: Defining Specific Security Requirements

Editor's Note: Click to read Part 2 or Part 3 in this series

Introduction

Expanding the scope of search within an enterprise, to enable employees and partners to more easily find data, seems almost directly at odds with the security requirements that mandate precisely controlled access to that same data. Amid this turmoil, search vendors have remained uncharacteristically quiet on the subject; while they may offer a few buzzword compliant check boxes on their data sheets, public information about tightly integrating security with search is scarce.
Why the Increasing Interest?

The media has made the general public more aware of security, or more importantly, the high profile failures of security. Within the enterprise, corporate legal, IT and PR departments have tried to protect their own companies by being more proactive, both in terms of technology and procedures. The government has also stepped in to add additional compliance regulations and penalties. Judges have even mandated the search enabling of archives as part of the discovery phase of large lawsuits.

But perhaps a more fundamental reason for the increased need for security is the large amount of data that is now being stuffed into corporate data stores, and subsequently being "search enabled". For example, the amount of email stored inside corporations is growing, and there is a trend to "throw the switch" and turn the corporate search engine lose on that data. Storage manufacturers are also accelerating the amount of fully indexed data by turning their products into "smart" devices, where the data stored within can be searched directly, without the need for an external search engine.

As more data is indexed, there's more chances for sensitive data to be easily retrieved. Don't believe us? Try searching for the word "confidential" on your internal portal.

We're search junkies, so generally we love this trend! But turning the genie loose on these vast amounts of data does seem at odds, at least on the surface, with maintaining security.

Why the complexity?

Security is reasonably well understood for things like bank accounts and shared file network storage. Document management companies have also extended security deep within their systems for some time now. So how hard should adding security to searchable text be?

Part of the reason may be that, relatively speaking, search is still the new kid on the block, and companies are still climbing the learning curve; some enterprises are still struggling to get basic search working system wide, and once they have it, the next priority is inevitably "fixing relevancy".

As companies progress with search, they do eventually start asking about security, as they rightfully should. The next phase usually involves the search vendor answering every question with "oh yeah, sure, no problem", or in the extreme the conversation degrades into an alphabet soup of abbreviations and technical terms that tend to give managers headaches.

Here are some of the factors that can make search engine security a bit complicated:

The lack of structure in many document stores.
The complexity of structure of other document stores.
There are many security standards and products already in place.
Some companies have also mixed in their own homegrown security or have legacy systems to contend with.
The search engine "spider" needs full access to all data, but needs to restrict results.
Search engine and security vendor vocabulary and product names.
Search vendors have been a bit sparse with public information on their websites.
Defining the business requirements for the various classes of users and classes of data, including the occasional requirement for "partial" access – more on that later.

Defining Requirements

Though it may seem obvious, the first part to implementing security is defining the requirements with a particular company. This is not as easy as it may sound. We've found that many clients haven't thoroughly defined this, and as the design progresses, some unusual requirements may surface.

Security Granularity

Granularity refers to the precision with which particular pieces of data can be secured. Companies have very particular rules about which documents, or which portions of which documents, various users can see. This is, by far, where the most interesting business requirements come from.

All or Nothing

The ideas that, if you are "logged in" or otherwise validated, you can search for information; if you are not logged in, you cannot do any searches. As an example, smaller companies may allow all employees on the internal network to search all of the indexed data sources. If you are logged in, you can search; if you aren't, you can't.

This had been the traditional model of usage, especially in smaller organizations, but becoming obsolete. Today most companies have at least some public content, and even non-authenticated users are allowed to see it; at the other extreme, most employees are not allowed to see financial and human resources files. Because of this, most companies have outgrown this security model.

By Collection / Repository

One of the easiest and still reasonably useful techniques is to simply segregate data by security requirements. Public data is grouped into one section, restricted into a second, highly confidential into a third, etc. Most search engines support the concept of "collections", which may also be referred to as "repositories", "sources", "document indexes", "spokes" or "document sets". Search engines typically allow each of these to be turned on and off, in various combinations, for each search. Once the credentials and access level of an individual user is determined, the appropriate collections are enabled for their search.

[Security at the Collection and Document Level: Four repositories are shown: The first repostitory is entirely green and labelled 'Public', the second is yellow and labelled 'Extranet: Customers & Partners, the third is striped with red, yellow and green and labelled 'Mixed: Record-Level Security, the fourth reporsitory is all red and lablled 'Private']

Fig. 1: Collection and Document Level Security

Implementing this "simple" idea can present some minor challenges, which we will cover later in this series.

By Document / Record

This method of securing data will feel very familiar to those with a database background. Certain groups or users can see certain documents.

Databases and Content Management Systems have had this technology for a very long time, and enterprise search engines are quickly catching up.

Conceptually, in realm of full-text search engines, the terms "record", "document" and "web page" mean almost the same thing, a retrievable unit of data; the specific terms used vary based on the background of the people working on the system or the physical source of the data.

[Document Level Security: Normally an entire document has the same security level.]

Fig. 2: Document Level Security

Note: If you are relatively new to search engines and have a database background you might want also want to read Contrasting Relational and Full-Text Engines from our June 2004 issue.

Complexities of this Security Model

One of the complexities with this model is the rendering of the results list. Typically a document or record itself will be well secured, but the search engine indexed all of the content and is displaying lists of titles and summaries in a results list. It's not enough to secure the actual document, but typically the results list is required to not even display a title or summary from a document that the user cannot see. Often even a title or summary can convey important information; this is the first surprise some companies have when they implement this level of security. As an example, a title of "Indictment of John Smith Expected Tomorrow" tips off John Smith, regardless of whether he can read the entire or not.

A more subtle detail of the secured results list is displaying the number of matching documents and the links that allow users to page through a long results list. A simple engine might display the total matching count of documents, whereas a highly restricted user may only have access to 10% of those records, so the count is quite misleading. Beyond cosmetics, the engine needs to have an accurate idea about what documents that user can see when it is offering links to pages 2, 3 and 4 of the results list.

An even more subtle detail, but which can still be a requirement in highly secure systems, is even confirming whether certain terms appear in the document index at all.

As an example, searches for terms like "layoffs", "indictments" or the names of specific people can partially confirm the presence of information, even if no document titles are shown. A highly secure search will not confirm or deny the presence terms in its index outside the context of what the user can search on. A more common example may be to not confirm the presence of obscenities or defamatory terms in non-accessible content.

By Field / Subdocument

At this level of detail the design and implementation complexity starts to ramp up. The general idea is that different users can see different portions of the same document.

Some examples:

All managers can see summaries of sales documents, but only VPs, Finance and Sales can see the specific financial terms.
Partners can see read the text of bug reports, but can't see the company the logged the issue.
Sales Engineers can view technical design documents, but can't read certain proprietary details.
Medical researchers can read legal cases, but cannot see patient details.

[Sub-Document Security: Based on Well-Defined Fields: Different sections of the document require different levels of security to accesss.]

Fig. 3: Sub-Document Security: Based on Well-Defined Fields

This still sounds pretty straight forward, but the implementation details can get a bit sticky. In the previous section we mentioned that, conceptually, "documents" and "records" are quite similar in the scope of search engines. However, from an implementation standpoint, subdividing a database record on field boundaries is much easier than subdividing a physical document, so when it comes to implementation, document vs. record DOES matter.

Selecting only certain search fields from a database is easy to control. But automatically detecting and removing certain parts of unstructured documents can prove difficult. If a set of documents were designed from the start for this purpose, tools like XSLT could be used to break them apart; in practice the search engine team inherits somewhat random sets of documents. In some cases formatting can be used to infer security context. Some document formats are harder to subdivide than others.

Typical ease of document subdivision:

Database record: easy (via select statement or view)
XML: easy (via XSLT)
HTML: moderate (HTML not always well-formed)
PDF: moderate to difficult (depends on PDF format)
Proprietary office documents: difficult (often requires document filtering library and custom code, or document conversion)

More open document standards are coming into use, and even Microsoft has plans to embrace them. So in the future, subdividing documents should become easier.

A Somewhat Odd Combination: "Title Teasers"

We've seen this implementation enough to call it out separately. Some sites that charge for content allow users to see the title of documents in their results list, and perhaps even a summary, but the user must then pay to see the entire text of the article.

This is a bit atypical because, until the user has paid, they do not have rights to read the document. We said previously that results lists shouldn't even show a title if the user doesn't have rights to see the document, this case is obviously an exception. It could be viewed as a rather extreme form of field level security. The other oddity is that the users' access to particular documents can routinely change, if they decide to pay. On the implementation side, this may require some adjustments to the system.

Sub Field

Vocabulary: Redacting. The act of removing very specific pieces of information from a document, such as specific words and phrases, or perhaps specific names and locations. The removed information may be represented by black-boxes, or perhaps removed entirely with no specific visual queue.

[Sub-Field Security: Based on Key Entities: Selective Reaction: Specific terms and referencs are removed, while still allowing partial disclosure. In this example, a document entitled 'Settlement Details' begins with the header 'Between Acme and ####' (the other party's name is blanked out) Portions of the main text are also blacked out.]

Fig. 4: Sub-Field Security: Based on Key Entities

In some cases it is a requirement to restrict information at the sub-field level. For example, we've all seen news reports that show documents where specific peoples' names have been blacked out. In this case the removal of information isn't bounded by a neat field or document boundary; it involves removal of more specific words and phrases at a very fine level of granularity and control. In some respects this is an extension of sub-document retrieval; if a document is unstructured, than removing portions of it use some of the same techniques as sub-field removal.

And yes, search engines can even be coaxed into handling this type of situation. Remember, it's not enough to remove these terms from the actual document when being viewed; most secure environments would also stipulate these words and phrases not show up in the results list titles or summaries. And further, a really secure system shouldn't even confirm that the removed words appear in the index at all.

Hybrid: Record AND Field

We realize that some of these scenarios sound like "overkill", but we have personally seen these requirements and worked to implement them at specific clients.

Moreover, some businesses' requirements require a combination of one or more of the techniques mentioned above. Some data is all public, whereas other repositories have a document by document access model. And then some documents have further restrictions within the document or fields. Due to NDA's we can not go into details, but the organizations that spend money to implement these highly customized systems do so out of necessity; they need to share data in a very controlled way, but with the convenience and efficiency of a search engine.

The point being that, no, we are not academics trying to dream up weird theoretical edge cases. Big organizations have lots of data and lots of folks who need to access it. In the past few years, as they have embraced search technology, their requirements have come along for the ride.

Levels of Users

This is the other side of the security equation. Generally this area is much more widely understood, and is about the same for search as it is for other systems. The details of implementation may present the only challenge.

Generally users can be classified as:

Global Status

All users who can access the system, or are otherwise "verified", share the same security credentials. As with data, this "one size fits all" model is often inadequate.

The one exception, where this model may make sense, is for completely public services, where every piece of data is intended to be public, and the search engine is not used for any internal data.

By General Status

In this model, access is assigned by title or rank within the organization. Levels of access might include Partner / VP, Management, Employee, Customer, Public. Or a similar model could be adopted based in Military rank, etc.

By Group / Role

In this model, arbitrary groups of users can be defined. Some of these groups may still be based on management level or rank. But also roles such as "Human Resources" and "Finance" can be defined to allow some subordinates in specific roles to have access to additional appropriate data. Another example would be letting customer service personnel access customer data. Another group sometimes defined is the user's current workgroup, so that they can easily share information with immediate coworkers.

Specific User

This model may be combined with the group model mentioned above. Security can be doled out on a user by user basis.

In terms of search engines, this model may be difficult to implement, depending on the implementation method and the specific search vendor. As we will discuss in our next installment in this serious, the preferred "early binding" security filter method may be overwhelmed by the potentially enormous security filter this model may require.

A special class of "user" is often "self" or "owner". Almost all systems allow users access to their own documents and content, unless their job is simple data entry. This could be considered a special "role" or group.

These last two security models are commonly associated with Access Control Lists (ACLs), a long standing security model. Data about specific users and groups may be implemented with Lightweight Directory Access Protocol (LDAP).

In Our Next Installment

Next time we will start talking about implementation. For those of you who were looking for the references to SSO and all the other protocols comprise modern security systems, that's implementation!

Meanwhile, interested readers might start working through your own requirements, nailing things down.; It is these types of complex requirements, and the alphabet soup of protocols and vendor products that make these systems a bit tricky to implement, but also very interesting.

And as always, we'd like to hear your thoughts and experiences on security. Which vendor you used, which add-ons, etc.; What worked, and maybe what didn't.

Editor's Note: Click to read Part 2 or Part 3 in this series