The robots.txt file of a website is something that is rarely understood – and for good reason. The [optimised] robots.txt file is different for every website. Sure, you’ll find files that are the same here and there, but a correctly optimised robots.txt file will most likely be different from site to site.
To understand how to set up a robots.txt file correctly we first need to understand what it is, what it does, the different elements of the file, and why we would use one / what we use one for.
The robots.txt should be thought of as a set of directions for web crawlers (AKA robots, bots, crawlers, spiders). It acts as a sort of ‘signpost’ to tell crawlers where they should and shouldn’t go on a website. Do not confuse robots.txt with a sitemap – they are completely different.
A sitemap – but you already knew that, right?
Robots.txt is not suitable for hiding content on a website. The file is public and anyone can access it by adding /robots.txt to the end of any top-level domain; see for yourself: https://www.alphadigital.com.au/robots.txt
Any URL added to the robots.txt file is publicly available and visible – hence this is contrary to the initial aim of trying to hide pages.
The file is also not a ‘be all and end all’ of controlling web crawlers. Most of the ‘main’ search engines and crawlers will obey most rules in a robots.txt file, however there are plenty of crawlers that will ignore these instructions and go anywhere they like. These tend to be malicious bots used for scraping email addresses or other such unethical practices – technically anyone can create a crawler, so resistance is futile; there’s no need to worry about the robot uprising for now though.
Typically, the main use for the robots.txt file is to save on crawl budget by telling the crawlers that they don’t need to visit every single section of your website. Back-end files and folders that hold little value to humans do not need to be crawled by search engines and they could instead be focusing on the pages that matter.
There are only a few elements to learn in a robots.txt file and all are fairly self explanatory. Some are more common than others, some are not recommended – we’ll cover them all here. Note that all pages, files and folders are
Allowed unless specified otherwise.
User-agent: [name of robot(s)]– This is used to specify which robot(s) the following instructions are for
Disallow: [URL]– Suggests that robots do not crawl the specified URL
Allow: [URL]– Allows robots to crawl the specified URL. Overwrites Disallow.
Sitemap: [Sitemap location]– Specifies the sitemap location
#– Used for commenting in the file. For humans only – ignored by robots.
*– ‘Wildcard’ character. Used to match multiple files within the same directory without specifying all individually.
$– Marks the end of a URL
Crawl-delay: [time in seconds, e.g. 10]– Tells robots to wait set amount of time between page crawls (uncommon, not honoured by Googlebot, interpreted as a time window by BingBot i.e. how long to spend on the site.)
Noindex: [URL]– Tells search engines to remove page from index (uncommon, not recommended)
/Alpha-Digital/is not the same as
/alpha-digital/. Be very wary of this when writing a robots.txt file.
User-agentis specified, the instructions below it will only count for the specified robot, other bots will ignore the instructions unless more rules are included and directed at them / all bots.
Disallowis specified all pages will be crawled.
Allowdirectives before the
Disallows for bots that will read the file in order of appearance. In the case of Google and Bing bots this does not matter, however for the sake of ensuring the results are the same universally it is important to consider this.
Sitemapis not necessary in robots.txt although is generally seen as good practice.
This will disallow crawlers from crawling anything within the
/do-not-crawl/ directory and subdirectories. Further uses will be covered later.
http://www.domain.comsection of the URL.
Screenshot of the Alpha Digital robots.txt file
*) meaning the following rules apply to all conforming robots.
/wp-admin/admin-ajax.phpis first set to
Disallowbelow it and lets only the specified path be crawled. For reference: this is part of the WordPress default robots.txt file. The
/wp-admin/area is being disallowed as it is not relevant to ‘normal’ users of the site and crawling would be unnecessary
Allowthe crawlers to crawl anything within the
/wp-includes/directory (and subdirectories) that ends with
.js. We are targeting ‘anything’ that ends with these file extensions using the wildcard character. Crawlers will essentially ignore the name of the file and just search for any file within this directory that has these file extensions.
/wp-includes/directory (and subdirectories). Again, there are a lot of files in this directory that have little value to humans and do not need to be crawled.
Disallowany file ending with
.php. Because we have used just a
/and not specified a URL the robots will look at any file across the entire site ending with
.php, as opposed to within a specified directory.
Disallowany URL that contains a question mark
?. As these URLs can be dynamic and / or contain information that is different per user there could be thousands. It is unnecessary for Google to waste budget looking for and crawling all these pages that are most likely all the same / very similar.
This is a small and simple solution for a WordPress site that will save on crawl budget whilst ensuring the CSS and JS files can be crawled. Google has recently stated that their crawler favours the ability to render a site completely and accurately to get a better idea of how an end user will see the site.
Be aware that even when you think you’ve become a master at robots.txt files, mistakes are very easy to make and can have an incredibly detrimental effect on a site’s performance in search engines and even drop off completely.
Luckily for us, Google have been kind enough to create a tool that helps us test the functioning of a robots.txt file. Make sure you’re logged into your search console account and follow this link: https://www.google.com/webmasters/tools/robots-testing-tool
Here I will delve into running a couple of test to give a practical example of how the tool works. If you don’t have a search console account or don’t want to run the tests you can either read through the examples without the practical reference or skip ahead to the Meta Robots section below.
Once logged in, select an account (for testing purposes any account is fine), and paste the code we used as an example, ensuring any previous code is removed:
User-agent: * Allow: /wp-admin/admin-ajax.php Disallow: /wp-admin/ Allow: /wp-includes/*.js Allow: /wp-includes/*.css Disallow: /wp-includes/ Disallow: /*.php Disallow:/*?*
Using the example URIs below, paste them into the testing box at the bottom of the page one at a time, hit “Test” and see the results.
Notice that these are all made up URIs and do not exist. This tool is for testing purposes only and is not performing a ‘live’ test but rather a simulated result. This means we can make up URIs to test directories or file paths to ensure we get the expected results.
Notice how the first example is blocked as we specified
Disallow for anything inside the
/wp-includes/ folder, yet the second example (which is also in the
/wp-includes/ folder) is allowed. This is due to the
Allow: /wp-includes/*.js section of robots file. As mentioned earlier the
Allow element overwrites the
Disallow element and using the wildcard character (
*) we can target any file with the .js extension (or .css as per the third example).
Without first testing this next example (if you haven’t already) try to predict what the outcome will be:
The testing tool highlights the line that is responsible for blocking the disallowed pages, you will see that for this example it is the disallowing of the
/wp-includes/ folder that is responsible for blocking the
/wp-includes/css/do-not-crawl.php file and not the line that specifies any file with the .php extension. This is because the file is (generally) read by robots from top to bottom; as a human would typically read it. If you remove the line that disallows the
/wp-includes/ folder you will see how the URI is still blocked due to it being a .php file. Changing the extension of the URI from .php to .css will then result in the URI being allowed.
Be aware that although the robots.txt file is actioned from top to bottom, the Google and Bing crawlers will take into account every line before they action anything. This means that even if you move the
Disallow: /*.php to the second line (just below the User-agent line) the
/wp-admin/admin-ajax.php URI is still allowed.
Hopefully this practical example will help you to understand what is happening and why, and how you can leverage special characters and different elements of the robots.txt file to save having to specify every individual page or file that you want to
Allow. Be sure to use this tool to thoroughly test a newly created robots.txt file before implementing it live on a site to save any headaches that may occur from seemingly minor errors.
As we covered near the start of this document, robots.txt is not the correct place to block the indexing of pages. The reason for this is because Google indexes pages based on links to the page.
e.g.: Imagine you have specified
Noindex: /do-not-index.html in the robots.txt file. When Google (or any search engine crawler) visits your site, they will first read the robots.txt file and see that you do not want to index the
/do-not-index.html page – the crawlers will (usually) honour this and will not index the file – great! But if an external web page links to the
/do-not-index.html page Google will index the page based on the link from the external site and ‘ignore’ your robots.txt file.
To counter this we use a page-level meta tag between the
<head></head> tags. These meta tags are much like meta titles or descriptions and share the same basic HTML markup, only they specify on a page-level how you would like search engines to handle the page. This way you can allow robots to crawl the page, follow the links on it and pass on any applicable link value, but not index the page in their listings. Here’s what the tag looks like:
<meta name=”robots” content=”noindex”>
It’s a nice simple piece of code that can be customised based on what you would like to achieve. The above example is specifying the meta name “robots” (which applies to all conforming bots, much like the
User-agent: * does in robots.txt) and the content (or ‘desirable action’ in this case) is
noindex, meaning “please do not add this page to your index”.
As mentioned, the code can be customised in a number of ways, for example:
<meta name=”Googlebot” content=”noindex,nofollow”>
This meta tag is specifically targeting the Googlebot crawler (Google’s main search crawler) and asking it to please not index the page and do not follow the links on the page. As with robots.txt any bot can be specified (individually). Nofollow links are another topic in themselves and won’t be covered here, but for completeness I have included it in the example.
Note that other content elements for robots meta tags are “index” and “follow”, which are respective opposites of the above content types.
Robots.txt can be a bit tricky to get to grips with, but with a little planning and testing you can have an optimised file put together in no time.
Now, enter into the world of robots.txt and take control of the crawlers!