The robots.txt file of a website is something that is rarely understood – and for good reason. The [optimised] robots.txt file is different for every website. Sure, you’ll find files that are the same here and there, but a correctly optimised robots.txt file will most likely be different from site to site.

To understand how to set up a robots.txt file correctly we first need to understand what it is, what it does, the different elements of the file, and why we would use one / what we use one for.

What robots.txt is:

The robots.txt should be thought of as a set of directions for web crawlers (AKA robots, bots, crawlers, spiders). It acts as a sort of ‘signpost’ to tell crawlers where they should and shouldn’t go on a website. Do not confuse robots.txt with a sitemap – they are completely different.

What robots.txt is NOT:

A sitemap – but you already knew that, right?

Robots.txt is not suitable for hiding content on a website. The file is public and anyone can access it by adding /robots.txt to the end of any top-level domain; see for yourself: https://www.alphadigital.com.au/robots.txt

Any URL added to the robots.txt file is publicly available and visible – hence this is contrary to the initial aim of trying to hide pages.

The file is also not a ‘be all and end all’ of controlling web crawlers. Most of the ‘main’ search engines and crawlers will obey most rules in a robots.txt file, however there are plenty of crawlers that will ignore these instructions and go anywhere they like. These tend to be malicious bots used for scraping email addresses or other such unethical practices – technically anyone can create a crawler, so resistance is futile; there’s no need to worry about the robot uprising for now though.

Why stop crawlers from crawling pages?

Typically, the main use for the robots.txt file is to save on crawl budget by telling the crawlers that they don’t need to visit every single section of your website. Back-end files and folders that hold little value to humans do not need to be crawled by search engines and they could instead be focusing on the pages that matter.

Elements of a robots.txt file

There are only a few elements to learn in a robots.txt file and all are fairly self explanatory. Some are more common than others, some are not recommended – we’ll cover them all here. Note that all pages, files and folders are Allowed unless specified otherwise.

  • User-agent: [name of robot(s)] – This is used to specify which robot(s) the following instructions are for
  • Disallow: [URL] – Suggests that robots do not crawl the specified URL
  • Allow: [URL] – Allows robots to crawl the specified URL. Overwrites Disallow.
  • Sitemap: [Sitemap location] – Specifies the sitemap location
  • # – Used for commenting in the file. For humans only – ignored by robots.
  • * – ‘Wildcard’ character. Used to match multiple files within the same directory without specifying all individually.
  • $– Marks the end of a URL
  • Crawl-delay: [time in seconds, e.g. 10] – Tells robots to wait set amount of time between page crawls (uncommon, not honoured by Googlebot, interpreted as a time window by BingBot i.e. how long to spend on the site.)
  • Noindex: [URL] – Tells search engines to remove page from index (uncommon, not recommended)

Important notes:

  • URLs are case sensitive, meaning /Alpha-Digital/ is not the same as /alpha-digital/. Be very wary of this when writing a robots.txt file.
  • If User-agent is specified, the instructions below it will only count for the specified robot, other bots will ignore the instructions unless more rules are included and directed at them / all bots.
  • Unless Disallow is specified all pages will be crawled.
  • Always define Allow directives before the Disallows for bots that will read the file in order of appearance. In the case of Google and Bing bots this does not matter, however for the sake of ensuring the results are the same universally it is important to consider this.
  • Sitemap is not necessary in robots.txt although is generally seen as good practice.
  • Any comments will not be recognised by robots and are to be written for humans only. Used to write notes about the mark up or for fun.
  • The Wildcard character essentially ‘fills in the blanks’ of a URL. e.g.

Disallow: /do-not-crawl/*

This will disallow crawlers from crawling anything within the /do-not-crawl/ directory and subdirectories. Further uses will be covered later.

  • Each instruction MUST be on a new line.
  • Only the URI (file path) needs to be specified, not the full URL. i.e. do not include the http://www.domain.com section of the URL.

Example robots.txt Files

Screenshot of the Alpha Digital robots.txt file

Alpha Digital robots.txt - Our Definitive Guide

Here is what is happening in this file:

  1. All crawlers are being referenced using the wildcard character (*) meaning the following rules apply to all conforming robots.
  2. /wp-admin/admin-ajax.php is first set to Allow. Allow overwrites the Disallow below it and lets only the specified path be crawled. For reference: this is part of the WordPress default robots.txt file. The admin-ajax.php file handles asynchronous javascript, something many WordPress themes use. It is recommended to leave this untouched.
  3. The /wp-admin/ area is being disallowed as it is not relevant to ‘normal’ users of the site and crawling would be unnecessary
  4. We then Allow the crawlers to crawl anything within the /wp-includes/ directory (and subdirectories) that ends with .css or .js. We are targeting ‘anything’ that ends with these file extensions using the wildcard character. Crawlers will essentially ignore the name of the file and just search for any file within this directory that has these file extensions.
  5. We then Disallow the /wp-includes/ directory (and subdirectories). Again, there are a lot of files in this directory that have little value to humans and do not need to be crawled.
  6. Disallow any file ending with .php. Because we have used just a / and not specified a URL the robots will look at any file across the entire site ending with .php, as opposed to within a specified directory.
  7. Disallow any URL that contains a question mark ?. As these URLs can be dynamic and / or contain information that is different per user there could be thousands. It is unnecessary for Google to waste budget looking for and crawling all these pages that are most likely all the same / very similar.
  8. Referencing Sitemap.

This is a small and simple solution for a WordPress site that will save on crawl budget whilst ensuring the CSS and JS files can be crawled. Google has recently stated that their crawler favours the ability to render a site completely and accurately to get a better idea of how an end user will see the site.

Testing Tool

Be aware that even when you think you’ve become a master at robots.txt files, mistakes are very easy to make and can have an incredibly detrimental effect on a site’s performance in search engines and even drop off completely.

robots.txt mistakes - The Alpha Digital Definitive Guide

giphy.com

Luckily for us, Google have been kind enough to create a tool that helps us test the functioning of a robots.txt file. Make sure you’re logged into your search console account and follow this link: https://www.google.com/webmasters/tools/robots-testing-tool

Running tests:

Here I will delve into running a couple of test to give a practical example of how the tool works. If you don’t have a search console account or don’t want to run the tests you can either read through the examples without the practical reference or skip ahead to the Meta Robots section below.

Once logged in, select an account (for testing purposes any account is fine), and paste the code we used as an example, ensuring any previous code is removed:

 

User-agent: *
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Allow: /wp-includes/*.js
Allow: /wp-includes/*.css
Disallow: /wp-includes/
Disallow: /*.php
Disallow:/*?*

 

Task:

Using the example URIs below, paste them into the testing box at the bottom of the page one at a time, hit “Test” and see the results.

wp-includes/this-is-a-file.html

wp-includes/important-script.js

wp-includes/style.css

 

Notice that these are all made up URIs and do not exist. This tool is for testing purposes only and is not performing a ‘live’ test but rather a simulated result. This means we can make up URIs to test directories or file paths to ensure we get the expected results.

Notice how the first example is blocked as we specified Disallow for anything inside the /wp-includes/ folder, yet the second example (which is also in the /wp-includes/ folder) is allowed. This is due to the Allow: /wp-includes/*.js section of robots file. As mentioned earlier the Allow element overwrites the Disallow element and using the wildcard character (*) we can target any file with the .js extension (or .css as per the third example).

Without first testing this next example (if you haven’t already) try to predict what the outcome will be:

wp-includes/css/do-not-crawl.php

The testing tool highlights the line that is responsible for blocking the disallowed pages, you will see that for this example it is the disallowing of the /wp-includes/ folder that is responsible for blocking the /wp-includes/css/do-not-crawl.php file and not the line that specifies any file with the .php extension. This is because the file is (generally) read by robots from top to bottom; as a human would typically read it. If you remove the line that disallows the /wp-includes/ folder you will see how the URI is still blocked due to it being a .php file. Changing the extension of the URI from .php to .css will then result in the URI being allowed.

Be aware that although the robots.txt file is actioned from top to bottom, the Google and Bing crawlers will take into account every line before they action anything. This means that even if you move the Disallow: /*.php to the second line (just below the User-agent line) the /wp-admin/admin-ajax.php URI is still allowed.

Hopefully this practical example will help you to understand what is happening and why, and how you can leverage special characters and different elements of the robots.txt file to save having to specify every individual page or file that you want to Disallow / Allow. Be sure to use this tool to thoroughly test a newly created robots.txt file before implementing it live on a site to save any headaches that may occur from seemingly minor errors.

How to Use Meta Robots to Block Page Indexing

As we covered near the start of this document, robots.txt is not the correct place to block the indexing of pages. The reason for this is because Google indexes pages based on links to the page.

e.g.: Imagine you have specified Noindex: /do-not-index.html in the robots.txt file. When Google (or any search engine crawler) visits your site, they will first read the robots.txt file and see that you do not want to index the /do-not-index.html page – the crawlers will (usually) honour this and will not index the file – great! But if an external web page links to the /do-not-index.html page Google will index the page based on the link from the external site and ‘ignore’ your robots.txt file.

To counter this we use a page-level meta tag between the <head></head> tags. These meta tags are much like meta titles or descriptions and share the same basic HTML markup, only they specify on a page-level how you would like search engines to handle the page. This way you can allow robots to crawl the page, follow the links on it and pass on any applicable link value, but not index the page in their listings. Here’s what the tag looks like:

<meta name=”robots” content=”noindex”>

It’s a nice simple piece of code that can be customised based on what you would like to achieve. The above example is specifying the meta name “robots” (which applies to all conforming bots, much like the User-agent: * does in robots.txt) and the content (or ‘desirable action’ in this case) is noindex, meaning “please do not add this page to your index”.

As mentioned, the code can be customised in a number of ways, for example:

<meta name=”Googlebot” content=”noindex,nofollow”>

This meta tag is specifically targeting the Googlebot crawler (Google’s main search crawler) and asking it to please not index the page and do not follow the links on the page. As with robots.txt any bot can be specified (individually). Nofollow links are another topic in themselves and won’t be covered here, but for completeness I have included it in the example.

Note that other content elements for robots meta tags are “index” and “follow”, which are respective opposites of the above content types.

To sum up:

Robots.txt can be a bit tricky to get to grips with, but with a little planning and testing you can have an optimised file put together in no time.

Make sure to be very careful when blocking pages, directories and subdirectories to ensure you are not blocking any CSS or JavaScript files as Google uses these to render a website in its entirety and may penalise you for restricting access – test, retest and then test again!

Now, enter into the world of robots.txt and take control of the crawlers!

robots.txt - Alpha Digitals Definitive Guide

giphy.com

SHARE THIS POST