Autonomy K2 v6: Indexing an ODBC Database
By Clinton Allen, New Idea Engineering Inc.
-
Volume 4 Number 2 - April 2007
This month Clinton gives us a Quick Start for building a K2 collection
with the ODBC driver and Vspider.
K2 is currently sold by Autonomy, though some people still refer to it
as Verity K2.
Step 1: Introduction / Understanding the Process
Indexing a database using ODBC and vspider requires more
configuration than most tasks using vspider, and the documentation
can be confusing. Here we provide an example to help get something
running and to serve as base for further development. To keep it
simple we focus on indexing records from the sample database
table EMP provided by Oracle.
We will only need to make minimum changes to
the sample K2 ODBC style files, using default values where ever possible.
In this article we're only going show how to index
one text column in the table; of course we will also have to index
one other column to serve as the primary key.
At the end of the article we've included a link to download the full set of
edited style files configured to index all of the columns of EMP.
Although the steps listed below may look complex, they actually
break down into a few logical tasks:
- Configure ODBC to access your database
- Tell K2 which database to connect to, and which fields to index
- Have K2 vspider fetch the records and index them
- Use K2 rcvdk to do a test search, to verify that the data was indexed
Quick Links / Summary of Steps:
Two other points to keep in mind:
- There is no single place where you enter an entire SQL
statement. K2/vspider builds up a SQL statement automatically,
from within its own logic, and using some snippets of SQL
that you can provide in various files.
- When indexing with K2 and vspider, we've seen it output a lot of errors
in log files that look scary, but which ultimately can be ignored.
Our point is, check the final results of the index with the K2
search tools. If they look correct, don't worry as much about
the errors in the log file. In particular, we believe some of these
errors are caused by gaps in non-sequential primary keys.
Learn to interpret the log file correctly, and separate out the
"common" one from the truly unusual ones.
[ back to top ]
[ back to top ]
Step 2: Information You Will Need
There are several pieces of information needed to configure K2
and ODBC indexing.
- You'll need the DSN (Database Source Name).
In this article the DSN is "EMP_TABLE"
- If a DSN has not been configured, then you'd need to
define it.
To define a DSN you would need to
know the Oracle server's TCP/IP name or address,
and the Oracle SID for the demo database.
- The name of the table that you want to index. In our case EMP.
- The names and types of the columns in the table,
including the table's primary key which will
map to K2's primary key (VdkVgwKey).
For the EMP table the primary key EMP_NO.
We'll index one text field, JOB, from the EMP table.
- Accessing a database requires a username and password.
For a default Oracle installation the demo username is
SCOTT, with password TIGER.
These will need to go into an encrypted security file created
in step 12.
[ back to top ]
[ back to top ]
Step 3: Tools
There are several useful tools for obtaining this information and
trouble shooting.
- You'll need a plain text editor for editing the style files.
For example, you can use Notepad on Windows.
On Unix/Linux you can use vi or Emacs.
- The Oracle command-line client SQLPLUS is handy for
verifying the connection to the database.
If it isn't installed on the machine running K2, then you
may need to run it on the server with the Oracle database
or from a PC with the Oracle client tools installed.
Once connected you can also examine the tables and columns
targeted for indexing.
You can also compare the results of a SQL query to the one
submitted by vspider.
- On Windows you will use the ODBC Data Source Administrator (ODSA)
to configure and test the ODBC connection.
The ODSA has a "test a connection" button to check
that the ODBC credentials and drivers
On other operating systems such as Solaris
you will need to edit the odbc.ini file. (discussed elsewhere)
This is how ODBC DSNs are mapped into database specific settings,
such as the Oracle SID, port numbers and other
client-server interaction and drivers.
- The Oracle Enterprise Manager (OEM) is used to examine and change
Oracle system settings.
This is a graphical tool and must be run on a windowing
environment such as Windows, X-Windows or on the Mac.
If OEM is not installed on the machine running K2, or if
the K2 machine is only accessed in a character mode such as
via telnet, it can be run from another machine where Oracle is
installed.
The OEM allows examination of the table and columns schema and
contents. If for instance the spelling of a column name is wrong than
the database can't be opened, and vspider gives a rather generic error
which can also be caused by the wrong username or DNS.
Also the trace log should be activated in the ODSA, under the Tracing
tab, or in the odbc.ini file. This log is useful for showing the complete SQL
query and troubleshooting problems with vspider opening the database.
[ back to top ]
[ back to top ]
Step 4: Defining ODBC Data Sources
Step 4a: On Windows
As we mentioned, on Windows there is a tool called ODSA, the ODBC Data
Source Administrator.
This can be found under:
- Start
--> Control Panel
--> Administrative Tools
--> Data Sources (ODBC)
Fill in the blanks from the information you gathered in Step 2.
The ODSA has a "test a connection" button to check the ODBC
credentials and drivers,
and it allows you to change the SID, port numbers and other settings.
On other operating systems such as Solaris
this information is set in the odbc.ini file, which is discussed
in the next section.
Step 4b: On Unix
On Unix you will need to define your data source in the odbc.ini
file, and then setup environment variables to point to it.
Sample Oracle ODBC INI file:
[EMP_TABLE]
Driver=/verity/install/path/k2/k2611-01-prd/k2/_ssol26/odbc/lib/VRora20.so
Description=DataDirect 5.0 Oracle Wire Protocol
ApplicationUsingThreads=1
ArraySize=60000
CachedCursorLimit=32
CachedDescLimit=0
CatalogIncludesSynonyms=1
CatalogOptions=0
DefaultLongDataBuffLen=1024
DescribeAtPrepare=0
EnableDescribeParam=0
EnableNcharSupport=0
EnableScrollableCursors=1
EnableStaticCursorsForLongData=0
EnableTimestampWithTimeZone=0
HostName=oracle-machine-name-or-IP-address
LocalTimeZoneOffset=
LockTimeOut=-1
LogonID=SCOTT
password=TIGER
PortNumber=1521
ProcedureRetResults=0
SID=EMP
UseCurrentSchema=1
In the sample above, the bold text indicates the important lines
that would need to be changed (or at least verified).
The underlined text is what is likely to be site specific.
Although we've shown the default port for Oracle, this does seem to change
for a fair percentage of sites.
The _ssol26 is for Solaris; if you are on Linux or some other flavor
of Unix, that string would be different. That is generically referred
to as _platform, or "underscore platform", where platform
refers to the operating system.
Unix Environment Variables
For Verity to index data properly, there are some environment variables
that need to be set correctly:
- PATH
- LD_LIBRARY_PATH
- ODBCINI
- - - Some other variables that you'll often need - - -
- VDKHOME
- ORACLE_HOME
- JAVA_HOME
- CLASSPATH
The details of setting all these variables are beyond the scope of this
article.
One final warning about Unix Environment Variables
When you login to an interactive shell to setup or debug problems,
you may be getting a different set of environment variables than what
automatic indexing scripts or web server processes see, because you
may be running a different shell.
This can happen when scripts are automatically run out of crontab,
or started when the system boots or by inetd.
For example, you may use C-Shell with .cshrc to do your editing,
but the nightly index scripts might be running
in the Bourne shell with .login. We've also see a combination of
some scripts running in Bash but the system picking up defaults from
an unexpected .cshrc.
[ back to top ]
Step 5: Copy Template Style Files
Make a copy of the shipped ODBC style files to customize. In a
standard installation these are in k2\common\styles\vgwodbc. For
our example we've copied the vgwodbc directory to a directory
emp_style in our sandbox and we'll modify those style files.
[ back to top ]
Step 6: Add Fields to STYLE.SFL and STYLE.UFL
First we add new collection fields to hold the database information.
The rules for adding these fields are the same as for any other K2
project. The VdkVgwKey is already
in the style.ddd, in step 8 we'll associate it with EMP_NO. So we need
to add one field to hold the contents of the DB column JOB. We can
just make it an alias of the existing varwidth "Title" field, already
defined in the shipped style.sfl. Furthermore it's convenient to give
each field the same name as the column. So in the style.sfl we need
only to add a new alias directive to the Title definition resulting
in:
varwidth: Title _sv
/alias = FTS_Title
/alias = dc:Title
/alias = vdk:Title
/alias = JOB # added for ODBC
The other fields required for indexing all of the EMP columns are in
the style.ufl of the linked set.
[ back to top ]
Step 7: Add the ODBC DSN into ODBCMON.CFG
The DSN and some parameters that control its use are set
in the DsnLimits section of odbcmon.cfg. In our example the DsnName is
EMP_TABLE and we want the SecurityField to be public so we've changed
the sample section of odbcmon.cfg to:
## Sample DsnLimits Section
DsnLimits:
{
DsnName: EMP_TABLE
# MaxConnects: 9
SecurityField: 0
# QuotedId: 0
}
Note "DsnLimits:" and the curly brackets are now uncommented; the
other parameters we've left at their default values.
[ back to top ]
Step 8: Edit VGWODBC.CFG
The vgwodbc.cfg is where the indexed table and columns are specified
by defining elements "gwview" and "gwfield"
Gwview has two parts:
- The DSN
- an alternate name used by K2.
We've set it to "EMP_TABLE EMPVIEW" The
first gwfield is the primary key EMPNO with a modifier "gwpk"
meaning the VdkVgwKey will be populated with it's value. Other needed
database fields should be listed also--for now we add only JOB. So
the edited lines in vgwodbc.cfg are:
gwview: EMP_TABLE EMPVIEW
{
## At least one field definition is required for each section.
## Each 'gwfield' becomes available for indexing and/or
## mapping to a VDK field.
## Setting up a primary key column
gwfield: EMPNO
/gwpk
gwfield: JOB #collection field
/table=EMP
/dbcolumn = JOB #oracle db name
}
Notice we've uncommented the opening curly bracket after the gwview:
line. At the bottom of the vgwodbc.cfg there are a few lines
concerning the file "freesql_1.txt" Uncomment the ones starting with
"freesql:" and the next-to-last closing brace, so the last few lines
appear as
## In ODBC gateway 5.0 and beyond an external file holds the
## parts of a query required to make a meaningful SQL statement.
## This file must be present and configured for each table. Note,
## this file is always assumed to be in the style directory.
freesql: freesql_1.txt
}
}
$$
In step 11 we'll create freesql_1.txt.
[ back to top ]
Step 9: Populating Fields with VGWODBC.VGW
The fields such as JOB were created to store the contents of the
database columns and they need to be assigned values. This is done
by copy directives to the gateway for each field/column combination
added to the vgwodbc.vgw. A copy for the VdkVgwKey isn't needed, and
the sample vgwodbc.vgw already contains a line for the _SECURITY
collection field which needs to be left in. For JOB we add the one
line "copy: JOB JOB." So the changed section of the vgwodbc.vgw
looks like this:
copy: _VgwAccessType _SECURITY
# oracle column --> collection field
copy: JOB JOB
A copy statement is required for each field that is populated from the
database. In the linked set we've added copy commands for all of EMP.
[ back to top ]
Step 10: Define the Virtual Document in STYLE.DFT
The style.dft is used to compose the virtual document
that K2 uses for full-text search and viewing of each record. For a
database it typically consists of fields printed in the virtual
document.
The style.dft in the shipped vgwodbc style files has all the lines
commented out except for matched curly braces, which should be left
so. First we uncomment the VdkVgwKey line and using directives "field"
and "constant" add JOB and some formatting. Thus the edited part of
the style.dft appears as:
## constant keyword to add static strings and
## view controls to the virtual document.
constant: " "
## VDK internal field.
## This field will be indexed as
## part of the document as well.
field: VdkVgwKey
constant: "\n"
## Database field
constant: "JOB: "
field: JOB
constant: "\n"
The linked set style.dft has all the EMP fields.
[ back to top ]
Step 11: Define SQL Statement in FREESQL_1.TXT
When indexing, vspider submits a SQL statement to the server. The
statement is contained in a text file with the default name of
'freesql_1.txt' residing in the source style directory ("emp_style" in
our case). When the collection is created, the file is copied to the
collection's style directory along with the other style files.
For our example, indexing all the records in EMP, the freesql_1.txt
contains only the single line:
FROM EMP
Vspider prepends a SELECT to this. The complete SQL statement
can be seen in the trace log. The name of this file is specified in
the last line of the vgwodbc.cfg, and the line needs to be
uncommented--done in step 9. The SQL statement must be on the first
line of the file, a blank line at the beginning of this
file results in a rather misleading error message E0-1514
" Cannot initialize table 'EMP_TABLE/EMPVIEW'... "
[ back to top ]
Step 12: Setup Database Authentication in CRED.DAT
Vspider needs the database username and password which it can read
from an encrypted file whose default name is "cred.dat" created by the
utility "savecred" located in the K2 bin directory.
An example command-line for our case is:
savecred -f cred.dat -t 0x4 -c user:scott -c password:tiger -c repository:DBMS:EMP_TABLE:0x0dbc0000:0x0dbc:0
(the previous command should be entered all on one line.)
The user name is "scott" and the password is
"tiger" The repository name is the same as the DNS name
"EMP_TABLE"
The hexadecimal characters should be entered exactly as shown. It's
convenient to put this command into a shell script or batch command.
Cred.dat should then be put into the source style directory used by
vspider (emp_style for us), so at collection creation time the
cred.dat will be copied into the collection's style directory. To use
this file vspider needs the -auth2 option shown in step 13.
Note that savecred will append onto an existing cred.dat so it might
save some grief in the development stage to delete an
existing cred.dat before running it. Using "savecred -help"
gives a usage dump. For changing the location and name of this file
see the line 'CredFile: "local cred.dat" ' near the top of the
odbcmon.cfg.
[ back to top ]
Step 13: Start the Indexing with the VSPIDER Command Line Tool
The minimum vspider command-line plus verbose logging is:
vspider -verbose -odbc -auth2 "emp_style\cred.dat" -collection empcoll -style emp_style
(the previous command should be entered all on one line.)
This creates the collection "empcoll" in the current working
directory using the "emp_style" style files and indexes the records
returned by the SQL query. The " start" option isn't needed. Note
we are pointing to the auth file in the source style files. This file
must be in the collection's style directory for use by the low-level
ODBC gateway code, but the vspider command line can point to a copy,
say, in the CWD.
You should get a fair amount of output to the console showing each
record indexed due to the -verbose option.
[ back to top ]
Step 14: Check the Results with the RCVDK Command Line Tool
When vspider's finished it should give the number of records indexed.
To check this and view the virtual documents you can use the
utility "rcvdk" in the K2 bin directory.
For detailed information on rcvdk please see the article
"Command Line K2: rcvdk" from Vol. 3 No. 4, Summer 2006.
First, at the command-line (for our collection empcoll in the CWD)
enter:
rcvdk empcoll
This gives a message about opening the collection and then a
prompt. Then enter "s" (without the quotes) to do a null
search returning all the records in the collection; for a new
collection this should be the same number as the number indexed by vspider.
RC> s (enter, by itself runs a null search, matching all docs)
(returns a document count)
RC> r (enter, to see the first screen of results)
You can also search for a specific word:
RC> s clerk (enter)
(returns a document count)
RC> r (enter, to see the first screen of results)
Then you can look at the matching documents (which may scroll by
very fast).
RC> v 1 (enter)
1: odbc://EMP_TABLE/EMPVIEW/7934
[ odbc://EMP_TABLE/EMPVIEW/7934 ] JOB: CLERK
The document view should reflect the style.dft and the data in each
record as shown by SQLPLUS.
To quit rcvdk just use the q command, followed by enter.
RC> q (enter)
[ back to top ]
Step 15: Review and Links to Style Files
In summary, the discussion above shows that while a vspider-ODBC
task takes more than a few steps.
But each step is rather simple if you have the right information,
and it is straightforward if you take them one at a time.
There are 4 or 5 style files that need to
be edited,
and 2 additional files (3 on Unix) that need to be created.
| File(s) |
Description |
Step |
|
style.ufl
|
Defines the collection's custom fields, AKA "User defined fields",
"JOB" in our example.
|
Step 6
|
|
style.sfl
|
Defines the collection's default fields, AKA
"Standard defined fields".
Usually not edited, but
in some cases a field you want may already be defined here,
so you would not need to add it to style.ufl.
|
|
style.dft
|
Defines the "virtual document", which is what the indexer sees,
and the default view of the document.
For a database the style.dft is usually just a list of fields whose
values are to be shown.
|
Step 10
|
|
vgwodbc.vgw
|
Populates the custom fields from the database columns using a
"copy" directive.
Each collection field that is to
receive a value from the database must be listed, except for the
VdkVgwKey.
|
Step 9
|
|
odbcmon.cfg
and
vgwodbc.cfg
|
Points K2 at the desired ODBC DSN, primary key and other columns.
The ODBC handle for the database to open is the DSN. The
desired DSN is specified in the odbcmon.cfg file whose contents, by
a "include" directive, becomes a part of the vgwodbc.cfg file.
The vgwodbc.cfg then associates the DSN, primary key and columns the
K2 collection fields.
|
Steps
7
and
8
|
|
freesql_1.txt
|
When vspider indexes it submits a SQL query which includes the
text in freesql_1.txt.
|
Step 11
|
|
cred.dat
|
A security "auth" file created by running the
K2 tool "savecred" and
contains the credentials to access the database.
|
Step 12
|
|
odbc.ini
(Unix only)
|
On Unix, this is where ODBC data sources need to be defined.
On Windows this is handled in the ODSA in the Control Panel instead.
|
Step 4b
|
For completeness
here are the 14 style files
that do not need to be modified:
| |
style.ddd
style.did
style.ngm
style.pdd
style.prm
|
|
style.sid
style.tkm
style.uni
style.ve
style.wld
|
|
style.xfl
style.xml
style.zon
vgwodbc.prm
|
Below we've included links to a complete set of style files.
These are a bit different than those used in the article, in that they
show how to index all of the fields from Oracle's EMP sample
table.
- Windows: emp_style_allfields.zip
- Zip format, CR-LF, 19.1 KB (19,595 bytes)
- Unix: emp_style_allfields.tar
- Tar format, LF, 66.0 KB (67,584 bytes)
[ back to top ]