Search this site:
Enterprise Search Blog
« NIE Newsletter

Ask Doctor Search: GSA Crawler is Missing Sites

A customer writes:

This month, a subscriber asks: I have had a Google appliance for about a month now, and it works well for most of the internal sites we need to crawl. What is driving me crazy is that two sites we need in our index just don’t seem to get indexed no matter what I try. Our older spider from another company never had any problem with these two sites: is there some hidden collection limit or content size issue that might be causing the problem?

Dr Search replies:

Dr Search replies: The bad news is that just because one spider can index a site does not mean that other spiders can access it. Before we get into that, let’s start with the easy part first.

Log onto your Google appliance (GSA) console, and have a look at the Google Search Appliance > Crawl and Index > Crawl URLs page. I’ll assume you have the top level of your missing sites included at the top in the ‘Start Crawling from’ input field, and that you have also included the missing site patterns in the ‘Follow and Crawl Only’ input field. Take a look at the input field labeled ‘Do Not Crawls URLs with the Following Patterns’ and make sure that the missing sites do not match any of these excluded patterns. You can easily verify that your starting points do not match the excluded patterns by using the ‘Test these patterns’ link.

Check Your Web Server Logs

The next question to answer is this: Has the GSA ever visited the missing web sites? I suggest you go to the web server log on the missing servers. Check to see whether your GSA crawler has ever tried to fetch pages. You should see lines something like the one in Figure 1.

2008-04-27 21:17:58 - 80 GET /robots.txt - 
404 gsa-crawler+(Enterprise;+S5-JD4J8XT6T4JJS;
2008-04-27 21:17:58 - 80 GET /index.html -
200 gsa-crawler+(Enterprise;+S5-JD4J8XT6T4JJS;

Figure 1: Web Server Logfile entries

When anyone views a web page, the web log captures the details. What you want to look for in your log file is any activity from your GSA crawler, which identifies itself as ‘gsa-crawler’ and is known as the User Agent. The GSA further provides a portion of the GSA license identifier and the email address of the GSA administrator. This last bit is so webmasters know who to contact if a web spider is unwanted.

The numeric value just before the Agent Name is the HTTP status code your web server returned to the GSA. A value of ‘200;’ is good: that’s a normal return. There are other ‘normal’ return values, so don’t worry too much if you see many other values (See the Resources section below for pointers to more information on web server status codes). The status code one to watch for is ‘404’: file not found. In Figure 1, you can see that our internal site has no robots.txt, so GSA will start to fetch content. However, if you have robots.txt, that is the next place to look for the cause of your problem.

Guardian at the gates: robots.txt

GSA always looks for robots.txt file before it proceeds: this is a friendly way to see if the webmaster wants to disallow any or all parts of a web site to crawlers/spiders and other automatic gathering tools.

The simple robots.txt file is shown in Figure 2.

User-Agent: *

Figure 2: A Simple robots.txt

Entries in this file are start with the User-Agent name – or in this case, a regular expression pattern that matches the User Agent you saw earlier in the web server log file. The asterisk here means ‘all user agents’; we’ll get to some specific examples in a moment. After the User Agent line are optional lists of what content is disallowed or specifically allowed by the web server. Figure 2 shows a robots.txt that allows full access to all user agents because of the asterisk and because the disallow field is empty. Figure 3, on the other hand, restricts all crawler access.

User-Agent: *
Disallow: /

Figure 3: Restrictive robots.txt

If your missing sites have robots.txt files, make sure to allow the GSA access. Figure 4 shows a robots.txt file that allows the GSA to access all content while disallowing all other crawlers.

User-Agent: gsa-crawler
User-Agent: *
Disallow: /

Figure 4: Disallow All Crawlers Except gsa-crawler

In addition to Disallow, robots.txt supports Allow capability as well; note that not all crawlers respect all aspects of all entries.

Why Doesn't GSA work like my Other Spiders?

The answer here is that robots.txt is a request to crawlers, not a demand. The GSA, like Google and Yahoo and other respectable web crawlers, tend to respect robots.txt directives, although there may be some variation. But there is no guideline that requires any crawler to respect any of your robots.txt directives.

Many commercial engines, including Autonomy’s vspider and Ultraspider and FAST’s crawler respect robots.txt by default, but they allow you to ignore any directives there as well. The rationale is that, since you are presumably indexing your own content with those products, you should be able to override your own directives. If you set up your previous crawler with instructions to override robots.txt, it may have fetched content with no difficulty. The GSA, on the other hand, has no option to override robots.txt – so you must go out and set up robots.txt properly to crawl sites.

I expect the steps in this article will have solved your problem with crawling all of your internal sites. If not, email us and we can drill down on your specific issue.


For additional information on using robots.txt, visit:

Help on HTTP status return values are discussed on these sites:

We hope this has been useful to you; feel free to contact Dr. Search directly if you have any follow-up or additional questions. Remember to send your enterprise search questions to Dr. Search. Every entry (with name and address) gets a free cup and a pen, and the thanks of Dr. Search and his readers.