Locator: NIE Home / Publications / Enterprise Search Newsletter / Volume 3 Number 4 / Ask Doctor Search
Ask Doctor Search: Indexing files, displaying web pages
Volume 3 Number 5 - Summer 2006
Our web content includes links to other parts of our company as well as to external partner web sites. To get full control over what gets indexed for search on our site, we have decided to crawl our file system rather than try to set spider rules and depth that varies by which site we index. How can we assign a display URL in K2 so clicking on a link jumps to the right page, even though the K2 vgkvgwkey is a fully qualified file name?
When you crawl a file system, the K2 vgkvgwkey field typically contains the fully qualified file name - for example, on our web site, this top level page is actually /usr/local/data/index.html. To reach that file from the web, you type http://www.ideaeng.com/index.html. (Note: because of web server tricks, you can often omit the actual file name for index pages; we're just showing the fully qualified URL here for clarity). With the file name in the vgkvgwkey field, and also in the doc_fn field, you need to do something to make sure the search result page link doesn't try to open a file on your user's computer.
Typically, what you want to do is replace the root filepath name with the top level URL; that is, for every record/document in K2, change /usr/local/data/ with "http://www.mydomain.com/". As mentioned, you can do this under your web application control; but a better solution is probably to solve the problem at index time so K2 always knows the right URL. This also insures that K2 command line tools like rcvdk and rck2 know how to view the document as well.
Unix Command: find /usr/local/data -name "*" -print > filelist Windows CMD Command: dir c:\inetpub\wwwroot\* /s/b > file_lists Figuire 1: Obtaining a List of All Files Under a Starting DirectoryOnce you have the file list, it's a fairly simple matter of writing a shell or Perl script to convert your file list into a bulk insert file like the one in Figure 2.
Sample Bulk Insert File vdkvgwkey: /usr/local/data/index.html URL: http://www.mysite.com/index.html <<EOD>> vdkvgwkey: /usr/local/data/about.html URL: http://www.mysite.com/about.html <<EOD>> vdkvgwkey: /usr/local/data/pubs/figures.pdf URL: http://www.mysite.com/pubs/figures.pdf <<EOD>> Figure 2: Sample Bulk Insert FileA clever shell script person with some regex experiecn can probably do it with a simple shell command or two!
The map file lets you specify an automatic substitutiona and copy from one style file field to another. For example, consider the map file shown in Figure 3:
Sample map file map URLS.map: URL http://www.mysite.com/ vdkvdwkey /usr/local.data/ Figure 3: Sample Map File for vspider This will cause vspider to copy the contents of the VdkVgwKey field into the URL field durijng indexing, replacing the string /usr/local/data with http://www.mysite.com/ anywhere it occurs in the original field.Now, when you are ready to index a file, use the additional vspider commands prefixmapfile to specify the file containing the map; and abspath to direct vspider to use fuilly qualified file paths as follows:
vspider Command LIne vspider -collection website -start c:\inetpub\wwwroot\ -abspath -prefixmap /usr/local/config/mapfile.txt Mapfile for vspider vdkvgwkey C:\inetpub\wwwroot\ url http://localhost/Note that you must use abspath; and while some documentation says you must use double backslashes ('escaped' in Unlix terms), we found that not to be the case. Also note that you will want to specify the trailng backslash (or forward slash in Unix) at the end of both patterns.When you have completed the ijndex, you can view the contents of your fields using rcvdk, a technique we describe elsewhere in this issue of Enterprise search.
rcvdk x fields url 35 vdkvgwkey 35 s rAs you can see, the prefixmapfile is apowerful tool, but be sure you understand that indexing a file system for the web comes with some risks. First, you may find that your web file directores have multiple versions of documents, or even worse, pore-release versiosn which you are not quite ready for general release. Thew vspider tool traverses an entire directory unless you specify an exclude pattern;l so use care.In Summary:
K2 has a number of options that ba provide powertful capabilities when yu need them., You can use any of several tools to populate fields, but whether you do it manually or using the vspider prefixmapfile option,We hope this has been of some use to you; feel free to contact me directly if you have any follow-up or additional questions. Remember to send your enterprise search questions to Dr. Search. Every entry (with name and address) gets a free cup and a pen, and the eternal thanks of Dr. Search and his readers.
Home | Products | Services | Newsletter | Resources | About Us | Contact Info | Privacy Policy
Copyright New Idea Engineering, Inc 1996 - 2008