Search this site:
Enterprise Search Blog

XPump™ Content Extraction Toolkit

There are some data sources that traditional spiders, entity extractors and ETL tools just can't get at.  For those times when no generic tool will do, you need a complete content extraction and validation language.  This is not for the average corporate customer - if you just need a web spider, basic entity extraction or just need to add fulltext search to a database - there are plenty of other solutions we can offer you!

Here are some project requirements that might make sense for a full XPump implementation:

  • Content must be aggregated from multiple places, such as web content and a database, or from a file share and an Excel spreadsheet.
  • Need to maintain complex state between pages; the route taken  to access the document contributes to its context, interpretation or Meta Data.
  • The structure within a document is critical, such as HTML code blocks, data within PDF tables, or Power Point slide boundaries.
  • You have complex data validation and normalization requirements.
  • Data extraction can happen via multiple paths, with fallback and failover logic.
  • The data itself affects how subsequent data is collected.
  • You need high performance extraction that fully leverages multiple CPUs and distributed servers.
  • You need a flexible framework to add your own Java code in to.
  • You need a technology partner to solve really complex data problems, not just a "vendor".

Key architectural components:

  • Clear XML based language
  • Pure Java
  • Fully threaded with auto load balancing
  • Full Contingency and Assertion Data Flow Options
  • Supports pattern, regex and xpath matching
  • Leverages JDOM, XSLT and JDBC
  • Includes built in PDF to DOM converter
  • Ask about XPump 2.0 features!

XPump is typically licensed in conjunction with NIE professional services.  Due to the complexity and proprietary nature of the toolkit we usually don't provide eval copies, docs or samples.