new idea ENGINEERING         Home  | Products  | Services  | Newsletter  | Resources  | About Us | Contact Info | Privacy Policy        

  Specializing in Enterprise Search since 1996 - including FAST, Autonomy, Google, Endeca, Dieselpoint and Lucene

Not a subscriber? Sign up at http://www.ideaeng.com/subscribe.html

Autonomy K2 v6: Indexing an ODBC Database

By Clinton Allen, New Idea Engineering Inc. - Volume 4 Number 2 - April 2007

This month Clinton gives us a Quick Start for building a K2 collection with the ODBC driver and Vspider. K2 is currently sold by Autonomy, though some people still refer to it as Verity K2.

Step 1: Introduction / Understanding the Process

Indexing a database using ODBC and vspider requires more configuration than most tasks using vspider, and the documentation can be confusing. Here we provide an example to help get something running and to serve as base for further development. To keep it simple we focus on indexing records from the sample database table EMP provided by Oracle. We will only need to make minimum changes to the sample K2 ODBC style files, using default values where ever possible. In this article we're only going show how to index one text column in the table; of course we will also have to index one other column to serve as the primary key. At the end of the article we've included a link to download the full set of edited style files configured to index all of the columns of EMP.

Although the steps listed below may look complex, they actually break down into a few logical tasks:

Quick Links / Summary of Steps: Two other points to keep in mind: [ back to top ]

[ back to top ]

Step 2: Information You Will Need

There are several pieces of information needed to configure K2 and ODBC indexing. [ back to top ]

[ back to top ]

Step 3: Tools

There are several useful tools for obtaining this information and trouble shooting.

Also the trace log should be activated in the ODSA, under the Tracing tab, or in the odbc.ini file. This log is useful for showing the complete SQL query and troubleshooting problems with vspider opening the database.
[ back to top ]

[ back to top ]

Step 4: Defining ODBC Data Sources

Step 4a: On Windows

As we mentioned, on Windows there is a tool called ODSA, the ODBC Data Source Administrator. This can be found under:
Start --> Control Panel --> Administrative Tools --> Data Sources (ODBC)
Fill in the blanks from the information you gathered in Step 2.

The ODSA has a "test a connection" button to check the ODBC credentials and drivers, and it allows you to change the SID, port numbers and other settings. On other operating systems such as Solaris this information is set in the odbc.ini file, which is discussed in the next section.

Step 4b: On Unix

On Unix you will need to define your data source in the odbc.ini file, and then setup environment variables to point to it.

Sample Oracle ODBC INI file:

[EMP_TABLE]
Driver=/verity/install/path/k2/k2611-01-prd/k2/_ssol26/odbc/lib/VRora20.so
Description=DataDirect 5.0 Oracle Wire Protocol
ApplicationUsingThreads=1
ArraySize=60000
CachedCursorLimit=32
CachedDescLimit=0
CatalogIncludesSynonyms=1
CatalogOptions=0
DefaultLongDataBuffLen=1024
DescribeAtPrepare=0
EnableDescribeParam=0
EnableNcharSupport=0
EnableScrollableCursors=1
EnableStaticCursorsForLongData=0
EnableTimestampWithTimeZone=0
HostName=oracle-machine-name-or-IP-address
LocalTimeZoneOffset=
LockTimeOut=-1
LogonID=SCOTT
password=TIGER
PortNumber=1521
ProcedureRetResults=0
SID=EMP
UseCurrentSchema=1
In the sample above, the bold text indicates the important lines that would need to be changed (or at least verified). The underlined text is what is likely to be site specific.

Although we've shown the default port for Oracle, this does seem to change for a fair percentage of sites.

The _ssol26 is for Solaris; if you are on Linux or some other flavor of Unix, that string would be different. That is generically referred to as _platform, or "underscore platform", where platform refers to the operating system.

Unix Environment Variables

For Verity to index data properly, there are some environment variables that need to be set correctly: The details of setting all these variables are beyond the scope of this article.

One final warning about Unix Environment Variables

When you login to an interactive shell to setup or debug problems, you may be getting a different set of environment variables than what automatic indexing scripts or web server processes see, because you may be running a different shell. This can happen when scripts are automatically run out of crontab, or started when the system boots or by inetd.

For example, you may use C-Shell with .cshrc to do your editing, but the nightly index scripts might be running in the Bourne shell with .login. We've also see a combination of some scripts running in Bash but the system picking up defaults from an unexpected .cshrc.
[ back to top ]

Step 5: Copy Template Style Files

Make a copy of the shipped ODBC style files to customize. In a standard installation these are in k2\common\styles\vgwodbc. For our example we've copied the vgwodbc directory to a directory emp_style in our sandbox and we'll modify those style files.
[ back to top ]

Step 6: Add Fields to STYLE.SFL and STYLE.UFL

First we add new collection fields to hold the database information. The rules for adding these fields are the same as for any other K2 project. The VdkVgwKey is already in the style.ddd, in step 8 we'll associate it with EMP_NO. So we need to add one field to hold the contents of the DB column JOB. We can just make it an alias of the existing varwidth "Title" field, already defined in the shipped style.sfl. Furthermore it's convenient to give each field the same name as the column. So in the style.sfl we need only to add a new alias directive to the Title definition resulting in:
varwidth: Title _sv
          /alias = FTS_Title
          /alias = dc:Title
          /alias = vdk:Title
          /alias = JOB     # added for ODBC
The other fields required for indexing all of the EMP columns are in the style.ufl of the linked set.
[ back to top ]

Step 7: Add the ODBC DSN into ODBCMON.CFG

The DSN and some parameters that control its use are set in the DsnLimits section of odbcmon.cfg. In our example the DsnName is EMP_TABLE and we want the SecurityField to be public so we've changed the sample section of odbcmon.cfg to:
## Sample DsnLimits Section
DsnLimits:
{
  DsnName: EMP_TABLE
  # MaxConnects: 9
  SecurityField: 0
  #    QuotedId: 0
}
Note "DsnLimits:" and the curly brackets are now uncommented; the other parameters we've left at their default values.
[ back to top ]

Step 8: Edit VGWODBC.CFG

The vgwodbc.cfg is where the indexed table and columns are specified by defining elements "gwview" and "gwfield"

Gwview has two parts:

  1. The DSN
  2. an alternate name used by K2.
We've set it to "EMP_TABLE EMPVIEW" The first gwfield is the primary key EMPNO with a modifier "gwpk" meaning the VdkVgwKey will be populated with it's value. Other needed database fields should be listed also--for now we add only JOB. So the edited lines in vgwodbc.cfg are:
gwview: EMP_TABLE EMPVIEW
{
  ## At least one field definition is required for each section.
  ## Each 'gwfield' becomes available for indexing and/or
  ## mapping to a VDK field.
  ## Setting up a primary key column
  gwfield:  EMPNO
    /gwpk
  gwfield:  JOB   #collection field
    /table=EMP
    /dbcolumn = JOB  #oracle db name
}
Notice we've uncommented the opening curly bracket after the gwview: line. At the bottom of the vgwodbc.cfg there are a few lines concerning the file "freesql_1.txt" Uncomment the ones starting with "freesql:" and the next-to-last closing brace, so the last few lines appear as
     ## In ODBC gateway 5.0 and beyond an external file holds the
     ## parts of a query required to make a meaningful SQL statement.
     ## This file must be present and configured for each table. Note,
     ## this file is always assumed to be in the style directory.
      freesql: freesql_1.txt
    }
   }
   $$
In step 11 we'll create freesql_1.txt.
[ back to top ]

Step 9: Populating Fields with VGWODBC.VGW

The fields such as JOB were created to store the contents of the database columns and they need to be assigned values. This is done by copy directives to the gateway for each field/column combination added to the vgwodbc.vgw. A copy for the VdkVgwKey isn't needed, and the sample vgwodbc.vgw already contains a line for the _SECURITY collection field which needs to be left in. For JOB we add the one line "copy: JOB JOB." So the changed section of the vgwodbc.vgw looks like this:
        copy: _VgwAccessType  _SECURITY
        # oracle column --> collection field
        copy: JOB JOB
A copy statement is required for each field that is populated from the database. In the linked set we've added copy commands for all of EMP.
[ back to top ]

Step 10: Define the Virtual Document in STYLE.DFT

The style.dft is used to compose the virtual document that K2 uses for full-text search and viewing of each record. For a database it typically consists of fields printed in the virtual document.

The style.dft in the shipped vgwodbc style files has all the lines commented out except for matched curly braces, which should be left so. First we uncomment the VdkVgwKey line and using directives "field" and "constant" add JOB and some formatting. Thus the edited part of the style.dft appears as:

     ## constant keyword to add static strings and
     ## view controls to the virtual document.
     constant: " "

     ## VDK internal field.
     ## This field will be indexed as
     ## part of the document as well.
     field: VdkVgwKey
     constant: "\n"
     ## Database field
     constant: "JOB: "
     field: JOB
     constant: "\n"
The linked set style.dft has all the EMP fields.
[ back to top ]

Step 11: Define SQL Statement in FREESQL_1.TXT

When indexing, vspider submits a SQL statement to the server. The statement is contained in a text file with the default name of 'freesql_1.txt' residing in the source style directory ("emp_style" in our case). When the collection is created, the file is copied to the collection's style directory along with the other style files.

For our example, indexing all the records in EMP, the freesql_1.txt contains only the single line:

FROM EMP
Vspider prepends a SELECT to this. The complete SQL statement can be seen in the trace log. The name of this file is specified in the last line of the vgwodbc.cfg, and the line needs to be uncommented--done in step 9. The SQL statement must be on the first line of the file, a blank line at the beginning of this file results in a rather misleading error message E0-1514 " Cannot initialize table 'EMP_TABLE/EMPVIEW'... "
[ back to top ]

Step 12: Setup Database Authentication in CRED.DAT

Vspider needs the database username and password which it can read from an encrypted file whose default name is "cred.dat" created by the utility "savecred" located in the K2 bin directory.

An example command-line for our case is:

savecred -f cred.dat -t 0x4 -c user:scott  -c password:tiger -c repository:DBMS:EMP_TABLE:0x0dbc0000:0x0dbc:0
(the previous command should be entered all on one line.)

The user name is "scott" and the password is "tiger" The repository name is the same as the DNS name "EMP_TABLE" The hexadecimal characters should be entered exactly as shown. It's convenient to put this command into a shell script or batch command. Cred.dat should then be put into the source style directory used by vspider (emp_style for us), so at collection creation time the cred.dat will be copied into the collection's style directory. To use this file vspider needs the -auth2 option shown in step 13.

Note that savecred will append onto an existing cred.dat so it might save some grief in the development stage to delete an existing cred.dat before running it. Using "savecred -help" gives a usage dump. For changing the location and name of this file see the line 'CredFile: "local cred.dat" ' near the top of the odbcmon.cfg.
[ back to top ]

Step 13: Start the Indexing with the VSPIDER Command Line Tool

The minimum vspider command-line plus verbose logging is:
vspider -verbose -odbc -auth2 "emp_style\cred.dat" -collection empcoll -style emp_style
(the previous command should be entered all on one line.)

This creates the collection "empcoll" in the current working directory using the "emp_style" style files and indexes the records returned by the SQL query. The " start" option isn't needed. Note we are pointing to the auth file in the source style files. This file must be in the collection's style directory for use by the low-level ODBC gateway code, but the vspider command line can point to a copy, say, in the CWD.

You should get a fair amount of output to the console showing each record indexed due to the -verbose option.
[ back to top ]

Step 14: Check the Results with the RCVDK Command Line Tool

When vspider's finished it should give the number of records indexed. To check this and view the virtual documents you can use the utility "rcvdk" in the K2 bin directory.

For detailed information on rcvdk please see the article "Command Line K2: rcvdk" from Vol. 3 No. 4, Summer 2006.

First, at the command-line (for our collection empcoll in the CWD) enter:

rcvdk empcoll
This gives a message about opening the collection and then a prompt. Then enter "s" (without the quotes) to do a null search returning all the records in the collection; for a new collection this should be the same number as the number indexed by vspider.
RC> s (enter, by itself runs a null search, matching all docs)
(returns a document count)
RC> r (enter, to see the first screen of results)
You can also search for a specific word:
RC> s clerk (enter)
(returns a document count)
RC> r (enter, to see the first screen of results)
Then you can look at the matching documents (which may scroll by very fast).
RC> v 1 (enter)
1:              odbc://EMP_TABLE/EMPVIEW/7934
                [ odbc://EMP_TABLE/EMPVIEW/7934 ] JOB: CLERK
The document view should reflect the style.dft and the data in each record as shown by SQLPLUS.

To quit rcvdk just use the q command, followed by enter.

RC> q (enter)
[ back to top ]

Step 15: Review and Links to Style Files

In summary, the discussion above shows that while a vspider-ODBC task takes more than a few steps. But each step is rather simple if you have the right information, and it is straightforward if you take them one at a time.

There are 4 or 5 style files that need to be edited, and 2 additional files (3 on Unix) that need to be created.

File(s) Description Step
style.ufl Defines the collection's custom fields, AKA "User defined fields", "JOB" in our example. Step 6
style.sfl Defines the collection's default fields, AKA "Standard defined fields". Usually not edited, but in some cases a field you want may already be defined here, so you would not need to add it to style.ufl.
style.dft Defines the "virtual document", which is what the indexer sees, and the default view of the document. For a database the style.dft is usually just a list of fields whose values are to be shown. Step 10
vgwodbc.vgw Populates the custom fields from the database columns using a "copy" directive. Each collection field that is to receive a value from the database must be listed, except for the VdkVgwKey. Step 9
odbcmon.cfg and vgwodbc.cfg Points K2 at the desired ODBC DSN, primary key and other columns. The ODBC handle for the database to open is the DSN. The desired DSN is specified in the odbcmon.cfg file whose contents, by a "include" directive, becomes a part of the vgwodbc.cfg file. The vgwodbc.cfg then associates the DSN, primary key and columns the K2 collection fields. Steps 7 and 8
freesql_1.txt When vspider indexes it submits a SQL query which includes the text in freesql_1.txt. Step 11
cred.dat A security "auth" file created by running the K2 tool "savecred" and contains the credentials to access the database. Step 12
odbc.ini (Unix only) On Unix, this is where ODBC data sources need to be defined. On Windows this is handled in the ODSA in the Control Panel instead. Step 4b

For completeness here are the 14 style files that do not need to be modified:

  
  • style.ddd
  • style.did
  • style.ngm
  • style.pdd
  • style.prm
  •   
  • style.sid
  • style.tkm
  • style.uni
  • style.ve
  • style.wld
  •   
  • style.xfl
  • style.xml
  • style.zon
  • vgwodbc.prm
  • Below we've included links to a complete set of style files. These are a bit different than those used in the article, in that they show how to index all of the fields from Oracle's EMP sample table.

    Windows: emp_style_allfields.zip
    Zip format, CR-LF, 19.1 KB (19,595 bytes)
    Unix: emp_style_allfields.tar
    Tar format, LF, 66.0 KB (67,584 bytes)
    [ back to top ]


    Home  | Products  | Services  | Newsletter  | Resources  | About Us  | Contact Info  | Privacy Policy
    Copyright New Idea Engineering, Inc 1996 - 2008