Help with Nutch IOException error - Ask Dr. Search

« NIE Newsletter

Ask Doctor Search: Mysterious Nutch Error, IOException - Job failed!

A reader asks:

Dear Dr. Search,

I'm getting a really weird error from Nutch (the open source web spider based on Lucene and Hadoop) and I'm really stumped. I've seen some references to this error online, but nothing definitive.

At the end of the main log file it's going along fine and then I see:

LinkDb: done
Indexer: starting
Indexer: linkdb: nutch_index/linkdb
Indexer: adding segment: nutch_index/segments/20081118190742
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: nutch_index/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I dug a little deeper and found this in the Hadoop log file:

end of nutch/logs/hadoop.log
2008-11-18 19:07:51,371 INFO  indexer.IndexingFilters - 
		Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-11-18 19:07:51,464 INFO  indexer.Indexer - Optimizing index.
2008-11-18 19:07:52,057 INFO  indexer.Indexer - Indexer: done
2008-11-18 19:07:52,057 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-11-18 19:07:52,073 INFO  indexer.DeleteDuplicates - Dedup: adding indexes in: 
		nutch_index/indexes
2008-11-18 19:07:52,198 WARN  mapred.LocalJobRunner - job_6q9lkk
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(
		DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

What the heck is going on?

Yours truly,
Nutch Dawg :-)

Dr. Search Responds

Dear Nutch Dawg,

I'm not going to be able to give you a complete answer here, but I can at least point you in the right direction.

In this case the Nutch "IOError - job failed!" and Hadoop's "ArrayIndexOutOfBoundsException" is just their way of saying "no documents were indexed". And failure to index documents could be caused by a failure to startup correctly at all, so in a broader sense, this really just means "there was a problem" - and heck, you already knew that!

To be fair, this is a known bug and has been fixed, but not in a precompiled version you can just download. The last precompiled packaged version (as of this writing) is Nutch 0.9, which has this bug; you'd need to download more recent source code and compile it yourself to fix the error message.

But to be clear, fixing the error message would NOT fix the underlying cause. In other words, if you downloaded the most recent code and compiled it, all you'd achieve is a clearer error message, something like "no web pages indexed" - hardly worth the trouble. You'd still need to figure out why no documents are being indexed or why Nutch isn't initializing correctly, so let's just move on and figure out why!

In the list below, I don't have time this week to go through every step of debugging, but I can give you a checklist of things to try. For the details, you'll have to look online like we do.

Nutch 0.7 vs. 0.9 - Yes, it Does Matter!

And an important note about searching for Nutch information online:: *** Always include the version number 0.9 in your search. ***

Nutch 0.7 was hugely popular, at least in the Google search engine's opinion, and it seems to be the version of Nutch information you get by default. But there was a major change between Nutch 0.7 and 0.8 / 0.9, and most Nutch 0.7 scripts and code will not work in versions 0.8 and 0.9! But Nutch 0.7 info is what Google tends to return.

Go ahead and use version 0.9, but just remember to include nutch 0.9 in your searches.

Nutch Troubleshooting Checklist

Here's my top checklist for debugging this, and a few notes on some of them. If you have other troubleshooting ideas to add, please send them along to us!

Nutch Initialization Problems Checklist:

i-0: general config / init info (below)
i-1: Not finding YOUR config files, and using its own instead.
i-2: You forgot to identify yourself (your instance of the Nutch spider) in your config files.
i-3: If using custom plugins, they're not working correctly.
i-4: Some other config error...

Nutch Spidering Problems Checklist:

s-0: general spidering tips (below)
s-1: There's something wrong with your seed page(s)
s-2: There's something wrong with your include or exclude patterns
s-3: Your seeds or intermediate pages have redirects, possibly subtle, that are counting as "levels".
s-4: Java runtime error, such as memory.
s-5: Mime-type issues
s-6: Security issues
s-7: robots.txt issues
s-8: Other spidering problems...

A Few More Details

Here's a few more notes on some of these items.

i-0: General config / init info

First off, Nutch expects you to run the spider from its main directory, and you need to have a conf subdirectory underneath with your changes. BUT Nutch comes with a default configuration pre-compiled into the code, and another default configuration in the conf directory you unpack - so there are TWO other places for Nutch to get a wrong config file from, and not warn you! And those config files will not have some of the required variables - this is by design - they want you to think about what you're doing!

Trust us, if you suddenly start seeing XML parsing errors, you've probably made some PROGRESS. Nutch may be reading your intended XML config files for the very first time, and you've just got some syntax items to clean up. In fact, if you're not sure Nutch is seeing your config, try adding a simple XML typo to your nutch-site.xml - like just add a <foobar> tag, with no closing </foobar> tag - that should really annoy the XML parser - but if it doesn't complain, then it's not seeing it!

Important files in nutch-0.9/conf include:

nutch-site.xml: This is your main config file; * Note: Do not change nutch-default.xml
crawl-urlfilter.txt
and regex-urlfilter.txt: These control what gets spidered and indexed and what doesn't.; What's the difference between them? Long story, we suggest you check both of them.

[Back to Checklist]

i-1: Not finding YOUR config files, and using its own instead.

Make sure Nutch is seeing your config, and not its default. We already suggested making an intentional typo in conf/nutch-site.xml and verifying whether you get an XML parse error or not.

If you don't, this is likely a class path issue, possibly related to what directory you're in when you run your nutch script.

You can set your classpath either on the command line to java with -cp or -classpath. Or you can set it in the CLASS_PATH environment variable. We think it's a bit weird that the environment variable has an underscore in the name but the command line option with the same name doesn't, but hey we didn't invent this stuff!

Make sure the actual conf directory is in your Java class path, and make sure that it is FIRST, before the Nutch jar files. This is to avoid picking the precompiled default config file stored away in those jar files.

[Back to Checklist]

i-2: You forgot to identify yourself (your instance of the Nutch spider) in your config files.

The authors of Nutch do not want you tormenting some poor web master's site with their tool. One way to create a feedback loop, so that web masters have a way to contact you or adjust their robots.txt file, is to have your spider identify itself and put your email address in their site's log file.

In your nutch-site.xml file, make sure you create properties for at least:

http.agent.url
http.agent.email
http.agent.name (optional, but helpful too)

The URL should point to a valid page on your site, preferably that explains why you are spidering web sites, and a valid email that you can be reached at.

[Back to Checklist]

i-3: If using custom plugins, they're not working correctly.

If you're using custom plugins, you might try turning them off for a quick test. This is set with an pattern in the config files, but if you're using custom plugins, we figure you know how to change this.

[Back to Checklist]

i-4: Some other config error...

For debugging "other" issues, we advise the "divide and conquer" approach. Try a much simple configuration and see if that works. If it doesn't, make it even simpler. If it does work, start adding back in your other config items.

There are a few things to keep in mind when spidering.

[Back to Checklist]

s-1: There's something wrong with your seed page(s)

Are you sure the URL exists? Maybe check for typos?

Also, make sure to include the http:// prefix.

And if it points to the top level of a server, or to a folder, try including the trailing slash. Use http://foo.com/ instead of http://foo.com, and use http://foo.com/support/ instead of http://foo.com/support.

[Back to Checklist]

s-2: There's something wrong with your include or exclude patterns

Carefully check both crawl-urlfilter.txt and regex-urlfilter.txt in your conf directory.

[Back to Checklist]

s-3: Your seeds or intermediate pages have redirects, possibly subtle, that are counting as "levels".

It's our understanding that redirects do count as a "level", even for seed pages.

For example, let's say one of your seeds was:: http://www.somesite.com
And you were just spidering one level deep, with the command line option:: -depth 1; (because maybe you just want to index the home page for a long list of sites)

This is likely to NOT work.

To understand why, try this with your web browser, paste in:: http://www.somesite.com
And then carefully check the URL address bar, it's probably now got:: http://www.somesite.com/

Notice the extra trailing slash. That was the result of a redirect, and one so common that most people don't even think about it. But in that case you'd need to either add the / to your seed, or include -depth to 2.

This redirect issue also impacts depth count for other subsequent pages, so if you have:: http://www.somesite.com; with a link to; http://www.somesite.com/support
You'd actually need -depth 4 to allow the spidering the full sequence of:: http://www.somesite.com; http://www.somesite.com/; http://www.somesite.com/support; http://www.somesite.com/support/

There's probably some config option to change this, but it's something to be aware of.

[Back to Checklist]

s-4: Java runtime error, such as memory.

Try increasing the amount of memory available to Java by adding the command line argument:

-Xmx100m (to allow 100 megs)

-Xmx500m (to allow 1/2 gig)

-Xmx1g (for 1 Gig)

The maximum on 32 bit java is usually around 1700m (1.7 Gigs).

Heck, on a 64-bit operating system with a 64-bit jvm, you could even try:: -Xmx6g (to allow 6 Gigs, if you've got the chips!)

[Back to Checklist]

s-5: Mime-type issues

Maybe you're getting a type of document that Nutch wasn't expecting. And note, checking the HTTP headers from the server is not sufficient - Nutch makes its own determination of minetype, and we've seen it override the server.

To force odd mime types into using the text filter you can edit the file:: config/parse-plugins.xml

As an example, to force text based email message files into text add an entry:

<mimeType name="message/rfc822">
		<plugin id="parse-text" />
	</mimeType>

The actual conf/mime-types.xml file is more complex, it allows you to tell Nutch how to recognize new types of files, sort of like the Unix "file" command. The two files mime-types.xml works in tandem with parse-plugins.xml; mime-types identifies a file type, and then parse-plugins tells the indexer what to do with it.

[Back to Checklist]

s-6: Security issues

Did you check whether the page requires some type of login or authentication to access? Or maybe your seed page is really https: instead of http:. No matter what the problem is, Nutch can probably handle it, but with a bit of research on the web.

[Back to Checklist]

s-7: robots.txt issues

It's possible a site is blocking your spider, or all spiders. Maybe they don't want to be indexed at all. Although more often sites with a robots.txt file are just trying to help you avoid indexing "junk" like log files. So just because there's a robots.txt file on their site, it doesn't necessarily mean "go away".

There are many articles online about robots.txt, if you think that's causing a problem.

[Back to Checklist]

s-8: Other spidering problems...

We again repeat the divide and conquer mantra. Try a MUCH simpler spider task and get that working first.

[Back to Checklist]

Closing

In closing, I would advise some patience and faith.

We've seen Nutch and Lucene do some amazing things, and the stuff does work once you get it configured and running. fast too!