Browsershots.org is a fantastic resource for web developers because it allows you to test how a web page will look in many different browsers. One of the most common problems with browsershots.org, however, is the robots.txt error that people often get when using the service: “Browsershots was blocked by yoursite.com/robots.txt”
Here’s the short version of why you are getting this browsershots robots.txt error and the solution:
Because you are developing a test site for a client, you are naturally going to have a disallow statement in the robots.txt file in your live development directory. Because you have a disallow statement, browsershots can’t access the directory to create the preview files. Thus, what you need to do is to temporarily change the robots.txt to allow browsershots.org to access the specific directory where your site is located. Once browsershots finishes its work, you can then go back and revise your robots.txt file to its original disallow statement.
Below is a more detailed explanation of this concept:
Background
The robots.txt file is a proverbial “gatekeeper” that tells a search engine spider what directories it can and cannot index. Some search engines do not respect the instructions in the robots.txt file, but most engines such as Google currently do appear to respect these instructions. In general, a production website would have a robots.txt file in order to make sure that spiders are able to find content in the site.
In addition to “telling” the web spiders what they can look at or index, the robots.txt file can also list what spiders shouldn’t see. Let’s look at an example:
Assume you have a web site with a root, /
, a /content
and a /development
directory, and you only want the search engines to index the root /
and the /content
directories, but not the /development
directory.
You could write a robots.txt file that looks like this:
[shell]
User-agent: *
Allow: /
Disallow: /development
[/shell]
Real-World Example
Moving back to how robots.txt affects Browsershots.org, and causes you to get an error, realize that most web developers have a domain they use for testing so they can put development versions of websites online for clients to see. Because the test site is online and live on the internet, albeit on a test domain and not the client’s actual domain, Google and other search engines may pick up this development site and index it on search engines — especially if you leave the test site up on the test domain for a while. While you eventually want to have your clients site indexed, you do not want a search engine indexing a clients development website. Imagine if you are building a website called helpspa.com (what a great idea), and when someone searches for helpspa.com on Google they get directed to www.developersite.com/test/helpspa.com — you will not have a happy client!
Thus, in order to leave the test site up and allow the client to see it, but at the same time prevent Google and other search engines from seeing this test site, you would create a robots.txt file that blocks the test directory. So using the example above, you can just replace the directory you wish to block from search engine spiders in your own robots.txt file, and then make sure that this copy is the one one the server. But now when you go to browsershots.org, you will get an error about browsershots not being able to access the site. Thus, you’d go back into the robots.txt and make sure that you remove the disallow or modify the robots.txt so that you directory is available for browsershots to view. Then when browsershots is done, just remember to change the robots.txt file back to what it was with the disallow statement.
To give you an example of how I do it, I have a test domain for my client sites. I’ll make up a name here but the concept is the same. Here is the directory structure
www.myfictitioustestsite.com/client1
www.myfictitioustestsite.com/client2
www.myfictitioustestsite.com/client3
www.myfictitioustestsite.com/client4
I have a robots.txt that looks like this:
[shell]
User-agent: *
Disallow: /
[/shell]
In this manner everything is blocked from search engines while I work. If I want to use browsershots to test the development site for client 2, I’d have something like this
[shell]
User-agent: *
Allow: /client2
Disallow: /
[/shell]
So that’s the story — let me know if you have any questions.