Review of SMILA - New Idea Engineering

« NIE Newsletter

Review of SMILA: A Hybrid Open-Source / Commercial Search Framework from Europe

Open source search technologies like Lucene, Solr and Xapian are interesting to folks interested in reducing total cost of ownership of enterprise and customer-facing search, but current packages received mixed acceptance in corporations where a complete solution and a useful framework are expected of any production application. SMILA, the outcome of German governmental and corporate funding, combines open source and proprietary components to become on the first full frameworks for semantic applications including search. This month, Eric Moore gives us an overview of the project and tells us about its strengths, weaknesses, and ambitious objectives.

SMILA is an Eclipse project that provides an extensible framework for building search applications to access unstructured information in the enterprise. It provides an integrated package based on Lucene that includes crawlers, connectors and the interfaces needed to manage it using existing infrastructure.

The main goal of SMILA is to reduce the risk of investment and IT costs by providing a standardized codebase in a common development framework that can be used to build semantic applications. Consider a company like Volkswagen: they may have 500 distinct applications running in Germany, so just maintaining connectors for those applications represents a significant cost. Building their own semantic infrastructure would certainly be a long term investment, and SMILA attempts to provide economies of scale while providing the option to use highly specialized solutions or plug-ins as needed. It also provides the opportunity for a company to reuse interfaces from internal projects that use Lucene.

All of the search algorithms, plug-ins, and connectors will not

In the SMILA project, not all search algorithms, plug-ins, and connectors will be open source. It’s core is implemented as a open source project to build momentum and to become a standard. The German Eccenca Foundation will provide a commercial distribution of SMILA similar to what Red Hat does for Fedora. The Eclipse IDE is a good example of how viable a combination of a open source platform with a good plug-in architecture plus commercial software can be.

SMILA is an attempt to:

jointly implement and maintain a connector pool
leverage Eclipse plug-in knowledge to create a plug-in market
make semantically optimized vertical collections available

In order to do this they are providing:

an extensible service-oriented architecture (SOA) framework to access and integrate unstructured information
ready to use components (crawlers, connectors and services) to demonstrate and leverage its components
interfaces to manage and monitor the framework and its components

The SMILA project will not:

create a generic data rendering engine or business intelligence/data mining tools
create search algorithms or linguistic technologies like classification algorithms and entity extraction tools.
create generic management/monitoring solutions - SMILA is meant to integrate into the existing infrastructure
create document indexing and retrieval components - SMILA integrates them

Sidebar: Background information about SMILA

SMILA is a new open source Eclipse project created by Empolis GmbH and Brox IT Solutions GmbH. The initial code came from them and DFKI (German Research Center for Artificial Intelligence), which joined the Eclipse Foundation in order to support SMILA. SMILA uses crawlers from the Aperture project, which is run by Aduna and DFKI. Aperture recently changed to a BSD license in order to support SMILA. Empolis and Brox have also created the Eccenca Foundation, which will offer a commercial distribution of SMILA with indemnification, warranties, maintenance, support etc. similar to what Red Hat does.

Funding for SMILA comes from the German government, affliations with other European government projects, and European companies. The number of full time developers has ranged from 10 to 20, most of them provided by Brox. SMILA is also being used as a base technology for a German government project called Theseus, whose members include SAP AG, Siemens, DFKI, Fraunhofer, Empolis, and BDMI. Some of the technologies Theseus is working on are the automatic creation of metadata for audio, video, 2D and 3D picture files and the semantic processing of multimedia documents. SAP AG for example is using SMILA for the TEXO use case in Theseus, to research automating the full life cycle of business services using web services. So far all of the add-on providers and partners appear to be European companies such as Aduna, Arexera GmbH, Be Informed, Brox, CAS Software AG, Connexor Oy, Empolis, INTRAFIND Software AB, Intelligent Views GmbH , Inter:gator, Moresophy GmbH and Ontoprise GmbH . One of the tests for how successful they will be is when they start getting non-European partners. However its still very early, they haven't had a official release yet.

Packaging

SMILA is available for both Windows and Linux, and requires Java 1.5 or later. There is no setup program, you just unpack the download and run smila.exe. The current version of SMILA includes web, file system, and jdbc crawlers and it is easy to get it working, though you can not configure everything from a GUI. For example, you need to edit a XML file to specify the seed.

SMILA is managed using a JMX client such as JConsole (part of the Java JDK). You could use something like JManage to manage it in a cluster. The Mbean provides a basic management interface with a standard cross-platform Java look and feel. Its functional but not polished.

How to install and start using SMILA:

Download and unpack the SMILA application from http://www.brox.de/en/products/smila.jsp.
Configure crawler rules:
1. Configure the file system crawler by setting BaseDir in configuration/org.eclipse.smila.connectivity.framework/file to a sample directory.
2. Configure the web crawler by setting Seeds in configuration/org.eclipse.smila.connectivity.framework/web to the URLs it should start with.
Run smila.exe . It will initially display "osgi>"; then after a brief delay, it should display "INFO: Starting Coyote HTTP/1.1 on http-8080".
Use JConsole to connect to the Mbean user interface. The JConsole window will list two local processes. One of them will be identified as JConsole. Click on the other one and press the Connect button.
Select the Mbeans tab in the console. Expand the listing for SMILA in the left pane.
Expand org.eclipse.smila.connectivity.framework.CrawlerController . Click on Operations.
Start the file system crawler by entering file in the text field next to the startCrawl button.
Press the startCrawl button.
Start the web crawler by entering web in that text field.
Press the startCrawl button.
Check the SMILA.log log file or look at the crawlers status in the console (its under SMILA Crawlers) to verify things are working.
Point a browser to http://localhost:8080/AnyFinder/SearchForm . Select test_index and enter text in the search form.
Press the Submit button.

Click here to see a screenshot.

All of the data from the crawlers is normally merged into one index, though you can edit the XML files to use a seperate index for each crawler. The 5 Minutes to Success documentation has screen shots and detailed instructions on how to install SMILA and get it working. Note there is a bug in the current Linux implementation that prevents JConsole from connecting. You can workaround that by making a remote connection to service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi instead.

The architecture is designed to support monitoring/managing components via JMX and SNMP. The JMX console is sufficient to get things working but when SMILA is released you will also need a SNMP console to manage it. Many of the technologies use different logging services such as the OSGi Logging Service, Java logging API, Apache Commons-Logging, or Log4j. In order to have a centralized log they have standardized on Log4j since its easy to redirect other loggers to it.

Searches are done via a browser. XSL template examples are provided for customizing the search form and search results. The search API has not been finalized yet, but they intend to provide both a low level (returns a description of the actual objects that were given to the Business Process Execution Language (BPEL) workflow processor) and a high level API that can be accessed by Web, Java and .NET clients.

There will be separate SMILA distributions for running on one or more PCs, a cluster, or a grid. Since they use a OSGi based build process it would also be possible to build one with a small enough footprint to run on a handheld device. The current installation only takes 56MB.

Sidebar: Side effects of using a open source SNMP implementation

What the SNMP console calls can get complicated due to what open source SNMP implementations are available for different languages. This is an issue if you develop a component or try to integrate one not designed for SMILA. The current plan appears to be that every component (that you can manage/monitor) has an agent that supports JMX. For Java components the agent would be part of the component and use a open source SNMP implementation for Java called SMTP4J to call the Mbean using the JMX server. SMTP4J doesn't have a wizard to generate the MIBs so they would have to be created manually. Non-Java components would use a seperate Java agent (which also uses SMTP4J and JMX) and have the Mbean call the component using a protocol such as CORBA or JNI. However, if its a C++ component they could use the open source AGENT++ library instead. That simplifies things since the agent would be part of the C++ component, and the SMTP console could call it directly.

They are considering including AdventNet SNMP Adaptor for JMX in Eccenca . Its a commercial SNMP to JMX adaptor that includes a configuration wizard that automaticly generates MIBs.

Sidebar: Open source search engines

Open source search engines based on Lucene such as SOLR and Nutch are becoming more widely used by enterprises. There is a lot of interest in SOLR but it does not provide a crawler. Nutch includes a crawler but it can be difficult to install and get working due to it being script based, poor documentation on what parameters need to be set, and sometimes needing to experiment with different builds to find one that works with your configuration. FLAX is another possibility (though its based on Xapian), but it doesn't provide a crawler. SMILA is based on Lucene and provides a crawler. However, its not yet another "do it yourself" search project. It attempts to provide the missing functionality needed to make a open source project a true competitor to commercial products by providing a integrated package with everything needed to work with unstructured information.

Semantic applications

A semantic application is frequently defined in terms of implementing the Semantic Web, or smart agents that can analyze all of the data on the web. While this kind of definition involves much hand waving and smile and mirrors, SMILA is clearly intended to deal with unstructured data in the enterprise - but it’s unclear how much semantic support it will support. Right now the "SeMantic" part of its name is not justified. It looks like version 1.0 would provide the base layers of a Semantic Web Stack plus support for "SPARQL Protocol and RDF Query Language" (SPARQL) and ontologies. It’s not clear whether that means they will support the Web Ontology Language (OWL) API, or whether that just means ontologies will be based on existing standards such as Simple Knowledge Organisation Systems (SKOS), Extensible Metadata Platform (XMP), Dublin Core Metadata Initiative etc. Either way that should make it useful for developing large scale isolated applications that do text mining and information analysis, plus research on how to implement higher layers of the Semantic Web Stack.

The first commercial SMILA compatible Resource Description Framework (RDF) add-on should be available in Q1 2009.

Status

It makes no sense to talk about security, relevance ranking, details of the search look and feel etc. at this time since SMILA is currently too rudimentary to evaluate as a production-quality search engine. You can search web sites under both Windows and Linux but it doesn't even turn the URL into a clickable link in the search results yet. The project started in June 2008 and was originally scheduled to have a version 1.0 release candidate December 31, 2008. The developers were ready to release the first milestone last fall but there have been major delays due to the Eclipse IP process. That’s one of the reasons why downloads are currently available at a web site provided by Brox rather than Eclipse.

Clusters aren't supported until 1.0 but the mailing lists have mentioned performance results for a two node queue based cluster. Its not possible to use all of the crawlers since schema are currently only provided for the file system, JDBC and web crawlers.

Sidebar: Current schedule

March 2009 version 0.5 M0

Basic architecture implemented
Simple search application available

June 2009 version 0.5 M1

More data sources supported
General configuration and management completed>

September 2009 version 1.0

Cluster support
Search API's
Security
Ontology service (used to reason about the properties of a domain and may be used to define it)
Advanced incremental update

Click here to view a digram of the SMILA architecture.

Basic Technologies

OSGi/SCA - provides a component based architecture and transparent communication to other hosts
JMS - provides scalability
BPEL - annotates information and calls services that refer to a record
JMX and SNMP - used to monitor/manage the application
XML - is used for data storage

Dependencies on other Eclipse projects

BPEL - can specify business processes as a set of interactions between web services
Birt - a reporting system for web applications
Equinox and other Eclipse run-time projects
g-Eclipse - a unified way to access grid and cloud resources
Higgins - provides two identity provider web services
STP - a SOA tools platform
TPTP - a test and performance tools platform

Crawlers

Web sites
File system
JDBC
Webdav
Outlook
Mbox (provided by Aperture but not mentioned as being in SMILA yet)
IMAP
iCal
Thunderbird address books
Apple address books

Aperture appears to be working on web site specific crawlers that would use a Bibsonomy, del.icio.us, Flickr and iPhoto API.

Free connectors

Enhanced web crawler that can extract encryption and authentication information, supports flash and javascript (soon)
File system, including NTFS user rights (soon)
Database
Query XML storage via XQJ
Index CSV files

Commercial connectors

CSV - available
EMC Documentum - available
INTERSHOP Enfinity - available
Lotus Domino - available
Microsoft Office Sharepoint server - available
XML files - available
Microsoft Exchange
LDAP
Software AG Tamino XML Server

The Outlook crawler requires Outlook and crawls Outlook files using a Java to COM bridge called Jacob. It looks like you could use it with the file system connector to index the contents of a .PST file (appointments, calendars, contacts, mail, notes etc.) on a PC or file share, but would need to use the commercial Microsoft Exchange connector if the data was stored on a Microsoft Exchange server.

File formats

Plain text
HTML, XHTML, XML
PDF (Portable Document Format)
Microsoft Office and Microsoft Works: Word, Excel, Powerpoint, Visio, Publisher
RTF (Rich Text Format)
MIME (message/rfc822 and message/news)
.eml files (email messages)
Corel WordPerfect, Quattro, Presentations
Open Office 1.x: Writer, Calc, Impress, Draw
OpenDocument: OpenOffice 2.x, StarOffice 8.x

Sidebar: Terminology confusion

A crawler can be configured to revisit a web page, but that has some serious limitations. A connector works around the limitations of the crawler, and is used to keep the search engine index updated all of the time. However, SMILA and the Aperture project use different terminology. SMILA defines both crawlers and connectors as connectors, of different types. It calls a "connector" a agent. The Aperture project provides crawlers for SMILA. They are called using a CrawlerHandlerBase class. Depending upon how its implemented that class could be a simple management shell that also calls a extractor to get the full text and/or metadata for a file format or it could be a connector. The reason why this matters is that when SMILA states it provides a connector its not always clear whether thats a crawler or a connector.

Search criteria

SMILA uses Lucene but doesn't appear to support the full Lucene search criteria. If you enter keywords it will implicitly AND them. Wild card, fuzzy and proximity searches also work, but Boolean operators don't. This is probably just a configuration specific bug since there is no obvious reason for them to restrict the functionality.

Click here to view a screenshot.

Customizing

SMILA's architecture is very componentized. For example you can:

Add a BPEL service (add a web service)
Add data source connectors (crawlers/agents)
Replace core components
Create your own distribution or embed SMILA in another application.

The developers claim a simple crawler can be added in about one hour.

The simplest way to add new functionality is to call a web service. However, the DOM object that the BPEL work flow would use defaults to only having record ids, which prevents it from accessing all of the data. That can be worked around by adding some filter rules but the recommended way is to create a pipelet (a reusable POJO component) that runs in the same OSGi runtime as the BPEL workflow engine. See How to integrate a component in SMILA for more information.

Summary

In some ways SOLR and SMILA are opposites, despite both being open source enterprise search projects based on Lucene. SOLR has a robust community and is mature, but it has no integrated crawler and is inherently tied to Lucene. SMILA has no community, provides a packaged solution that includes crawlers and connectors, provides a vendor neutral way to search unstructured data, and uses a extendable framework that is so componentized that it might be possible to replace Lucene.

It’s too early to tell how useful or successful SMILA will be. If it becomes the standard that its backers hope, it would create significant competition for companies selling solutions based on expensive high-margin search technologies, and create a new market. If it doesn't do much more than survive in its current niche, that would still provide a open source enterprise search engine with integrated crawlers, good packaging and commercial backing for its continued development/maintenance.

The ability to buy a customized, installed and configured search application from a open source repackager has been creating pressure on the prices of traditional vendors. SMILA should create additional pressure, though its impact will depend upon how popular it becomes.

Resources

There are some additional resources that are useful in learning more about SMILA. A good starting point would be to browse the SMILA wiki and then the Eccenca Foundation blog. Many of the SMILA specifications are technical proposals and/or discussions rather than actual specifications, but they're starting to get more frequently updated.