new idea ENGINEERING         Home  | Products  | Services  | Newsletter  | Resources  | About Us | Contact Info | Privacy Policy        

  Specializing in Enterprise Search since 1996 - including FAST, Autonomy, Google, Endeca, Dieselpoint and Lucene

Locator: NIE Home / Publications / Enterprise Search Newsletter / Volume 3 Number 3 / Ask Doctor Search

Not a subscriber? Sign up at http://www.ideaeng.com/subscribe.html

Ask Doctor Search: Using Enterprise Search on Network Storage Devices

Volume 3 Number 3 - Spring 2006

Dear Dr. Search:

Can I put my search engine indexes on a network drive?  For example, can I build K2 collections over the network to our NAS box, point the K2 server at that same box for searching?

Dr. Search answers:

Great question! The short answer is "it depends..."

The allure of networked drives is very strong, just build and search all of your indices in one location, and have all indexing and search servers access that one place.

First off, yes, you're right to be concerned.  Generally speaking, ALL search engines will run slower on a generic networked drive than they would on local storage.  The only difference is whether each vendor specifically talks about this issue or not.  But this is not a blanket "no, don't do it", keep reading!

Why Storage Matters:

Under the covers, at a very low level, all search engines have a similar architecture.  They all "digest" documents ahead of time, and store information about every word in every document in some proprietary set of binary files.  Library card catalogs and the Index section of books are good analogies.  When a search is run, they consult their binary indices to lookup the answers quickly, vs. having to check the original documents.  Vendors have different names for these binary files, "collections", "indexes", "segments", "document catalogs", etc.  Of course there are features and performance, etc, we're not saying that all engines are "the same" in terms of functionality, but just that they all use indexes underneath.

Since these indices form the core of every search that's performed, search engines make heavy use of them; in fact search engines spend most of their time jumping around inside these binary files.  So any action that slows down this access will slow down the search engine.

Factors of Disk Access:

Because disk access is so critical to search engine performance, it's worth drilling down a bit further.  There are three general areas of disk access that are affected by local vs. networked storage:

  1. Throughput: how many bytes per second
  2. Latency: how long each new request takes to start returning data
  3. OS factors: how the operating system handles caching and file handles, etc.

Throughput and Latency:

The first factor most people talk about is throughput, how many Megabytes (or Gigabytes) per second can be read into the computer's memory.  While this is certainly a factor, we think it gets too much attention.  Raw thoughput is important when reading or writing single large blobs of data, such as database dumps or video files, but this isn't the type of usage profile a search engine has.

The second factor, latency, needs to be given more consideration.  Search engines do a lot of "round trips" back and forth to the indices stored on disk.  Let's look at an extremely simplified example:

  • User looks for "budget"
  • The engine opens the "word index" to look for the base location for all "budget" references
  • The engine then must seek to that separate data segment, where the "budget" info actually is.
  • The engine must then read through all instances of "budget" to find matching documents; typically it will find internal document IDs that reference an internal documents table.  If this were a multiple word query, it would need to do the same for all other words.
  • For each ID that "budget" matched, the engine must then access the list of documents.  First it accesses the base of the list, then looks up the matching IDs it has.
  • For each internal document ID, it must then lookup each field (title, date, URL, etc)
  • For each field, it then has another set of accesses that it does to locate the field's contents.

Notice that, for most of these steps, another complete round trip to the disk is needed; and the next step can't proceed until the results of the previous step are returned.  Even each access of a keyword, document ID or a field value often requires multiple trips to the disk; once to find the "base" and do the lookup, then another to access the "instance".  Of course vendors can score points by heavily optimizing all these steps, and caching can minimize physical access.

Multiply all this by millions of documents, multiple search terms, and the vendor's document ranking logic, and you can see there are MANY round trips back and forth to the disk.  It's this sequential nature of disk access which greatly multiplies any problems with latency; any slowdown for an average disk access is multiplied many fold by all these round trips.

This round trip latency issue is probably the biggest Achilles' Heal for average networked storage.  Understanding this ahead of time, it can be partially compensated for.  Network physical and logical topology can be altered to minimize latency.  For example, can the disk storage, indexing and searching machines all be put on the same network segment?  Can ancillary network traffic be moved elsewhere?  These types of optimizations can dramatically reduce latency, and make network storage more likely to perform acceptably.

 
 

Sorting out NAS and SAN

Dr. Search always smiles when he sees computer acronyms like this; it's like they're trying to make this confusing. Yes, NAS and SAN are both related to disk storage, and No they don't mean the same thing. But because they sound so similar, many casual observers are easily confused.  And in recent years, the distinction between these two technologies has blurred, making it more confusing.
For the record:
NAS = Network Attached Storage
SAN = Storage Area Network

Generally speaking, NAS is more common, usually less expensive, and often slower.  NAS includes those "network appliances" popular in many companies.  In recent years there have even been "networkable" disk drives aimed at consumers and small businesses; some even support Wifi as well as Ethernet.  A traditional File Server would also be considered NAS.  NAS is typically easier to integrate because virtually all computers are already networked, and therefore able to access NAS; in a few cases additional drivers may be needed.

There are high end NAS solutions.  Network Appliances has some higher end NAS products that approach the performance of SAN solutions.

SAN solutions are more expensive, sometimes much more expensive, and usually offer industrial strength performance.  They attach to computers at a much lover level, and the computer usually seems them "local" storage.  An early example of this type of system were the shared disks in the old VAX Clusters (though we don't recall it being called "SAN" back then).  More modern implementations often include Fiber Channel, a type of high speed fiber optics based connection.  Some SAN systems can now use Ethernet, which is what NAS typically uses, thus making the distinction a bit more confusing.

Comparing the NAS to SAN in "Nerd Speak", NAS devices are "file based"; whereas SAN are "block based" devices.

If you have large data volumes you should consider higher end NAS products, or SAN solutions.

 

Additional Operating System Factors:

The third factor is a bit more of a wildcard.  The underlying operating systems on both ends of the connection, and their underlying network protocols, can have a more unpredictable impact on networked storage performance.  Given all the variables the best we can do is warn you about the general areas we've seen.

File handles are used by programs to access the files on a hard drive, either local or networked, and there is typically a limit to how many file handles can be used at once; this limits the number of files a process can have open.  Some search engines have hundreds, or even thousands, of separate files in their search indices.  If there are too few file handles, the search engine will need to close some files before opening others; this takes time.  If a busy search engine routinely needs to access more files than the number of file handles will allow simultaneously, it will waste a lot of time opening and closing the same files over and over again.  Most engines WILL NOT COMPLAIN about this situation, they will assume this is "normal" and simply run slower.

Most modern operating systems will allow thousands of files to be open at once; this was often not the case back in the 1990s.  On Unix, however, this number can still sometimes be artificially capped; if you run on Solaris, Linux, etc., you should double check this with the "limit" or "ulimit" command.

More importantly, NETWORK file handles may be treated differently than local files.  A system might allow for hundreds of local files to be open, but for only a dozen or so network files.  Dr. Search hasn't seen direct evidence of this in many years, so this too may moot with modern systems.  Checking this might not be straightforward either.  You could hold off checking this unless you really think you're having a problem and have exhausted other avenues.

File caching behavior can also be different between local and network files.  File caching is often most evident when you compare the first search runtime to subsequent searches' times; the first search will be very slow, but most searches after that will be faster.

As an example of local vs. remote file caching, Dr. Search has noticed that Windows will cache local files rather consistently, even between two different process runs at different times.  However, the caching of this same data, when stored on a network drive, is much more unpredictable.  This type of variance also makes "side by side" testing difficult – test results compare the two can vary depending on which order the tests were run in – this can give confusing results if the tester is not aware of this variable.  To be really safe, a thorough tester could consider rebooting all machines between each test, though of course this would be time consuming.  And all tests much include both "initial search" and "subsequent search" times as separate.  Also, searching for the same terms a second time can give different results than searching for different terms.

Considering what is "fast enough":

With all this said, network storage might work OK for your specific application.  Many search engines are really fast these days, so for many applications they have lots of horsepower to spare.  If you only have 10,000 documents and only moderate search activity, even a "bad" setup might be fast enough.  If you add in more documents or more search load, you might still get away with it, although you may have to make some adjustments (network topology, etc)

When you consider the operating system, remember to consider the OS in use at both ends of the connection, on both the computer that is running the software, and the computer or appliance that is hosting the disk.  A heavily loaded generic file server is going to be much less responsive than a dedicated server or appliance.  Advanced groups might also experiment with different network and disk mounting protocols; for example you could compare TCP/IP to NetBEUI, or NFS to SMB.  And, broadly speaking, NAS will be faster than SAN, though the lines between the two are blurring and there are certainly exceptions to this rule.  See the sidebar to this article for definitions and further discussion.

Vendor Statements:

Some vendors talk about disk drives and hardware earlier in the sales process than other vendors.  These careful vendors are trying to make sure that their product delivers what is promised, so they quote specific performance numbers, implemented on specific system configurations.  In order to show the best performance, these spec'd systems will often indicate local disk storage, or perhaps even high end RAIDed SCSI disk systems; this doesn't mean their software won't work on other configurations, they just can't promise what the performance trade off would be.  Other vendors are more casual; they assume that everybody knows that cheap disks are slower than high end SCSI, and that network storage is generally slower still.  And they assume that customers understand that slower disks means slower search times, and that at the end of the day the customer will test their own specific configurations and make adjustments as needed.  When you are selling expensive solutions to sophisticated IT groups, it's certainly reasonable to assume that, no matter what you tell them, they will ultimately trust their own analysis and testing anyway.

Dr. Search can understand both approaches and is not taking sides; each vendor and IT department is different.

But please understand that ALL VENDORS are affected by these system configurations, and the only difference is how proactive each vendor is in quoting specific configurations.  So if you intend to use network storage, and Vendor A spec's high end local SCSI, and Vendor B says nothing, that doesn't mean you should throw out Vendor A!  Instead, you should come up with consistent questions to ask both Vendor A and Vendor B.  Also, if performance is a still a concern and you still want to use network storage, ask the vendor about an on-site evaluation.  Or make room in the implementation schedule for additional performance testing.  Another approach is to have some discussions with the storage system vendor or in house experts.  There are many choices in shared storage; if one solution doesn't fit, another one might.

In Summary:

Can you put your search indices on network storage?  Maybe!

Key points:

 


Home  | Products  | Services  | Newsletter  | Resources  | About Us  | Contact Info  | Privacy Policy
Copyright New Idea Engineering, Inc 1996 - 2008