XPump™ Content Extraction Toolkit

Top

Search this site:

Enterprise Search Blog

XPump™ Content Extraction Toolkit

There are some data sources that traditional spiders, entity extractors and ETL tools just can't get at. For those times when no generic tool will do, you need a complete content extraction and validation language. This is not for the average corporate customer - if you just need a web spider, basic entity extraction or just need to add fulltext search to a database - there are plenty of other solutions we can offer you!

Here are some project requirements that might make sense for a full XPump implementation:

Content must be aggregated from multiple places, such as web content and a database, or from a file share and an Excel spreadsheet.
Need to maintain complex state between pages; the route taken to access the document contributes to its context, interpretation or Meta Data.
The structure within a document is critical, such as HTML code blocks, data within PDF tables, or Power Point slide boundaries.
You have complex data validation and normalization requirements.
Data extraction can happen via multiple paths, with fallback and failover logic.
The data itself affects how subsequent data is collected.
You need high performance extraction that fully leverages multiple CPUs and distributed servers.
You need a flexible framework to add your own Java code in to.
You need a technology partner to solve really complex data problems, not just a "vendor".

Key architectural components:

Clear XML based language
Pure Java
Fully threaded with auto load balancing
Full Contingency and Assertion Data Flow Options
Supports pattern, regex and xpath matching
Leverages JDOM, XSLT and JDBC
Includes built in PDF to DOM converter
Ask about XPump 2.0 features!

XPump is typically licensed in conjunction with NIE professional services. Due to the complexity and proprietary nature of the toolkit we usually don't provide eval copies, docs or samples.