Everything You Need to Know About Robots.txt file

When you’re building your website, you may come across a small text file called robots.txt on your web host server.

Robots.txt, when utilised correctly, can optimize your crawl frequency, which can give your SEO efforts a significant boost.

This article takes a closer look at robots.txt, how do you go about using it, as well as some of the things you should avoid.

What is this robots.txt file?

As you know, search engines work by constantly indexing websites on the internet. To do so, they deploy crawlers that ‘reads’ websites and classify them in appropriate categories.

So whenever a search engine user looks for something online, the relevant search results gets retrieved.

A robots.txt file is a small but essential file that gives instructions to search engine crawlers about what to do when they are at specific pages of your website.

Robot.txt file uses simple syntax (called directives) to instruct crawlers on how to crawl your website.

How to create a robot.txt file?

You can create a robots.txt file using any plain text editor.

But you must place this file in the root directory of the site and must be named correctly as “robots.txt” Placing this file anywhere else with a different name won’t make the file work.

So if your domain is mywebsite.com, then the robots.txt URL should be:

http://mywebsite.com/robots.txt

The ‘user-agent’ refers to the thing that is sending the request. Anything that requests web pages, such as search engine crawlers, web browsers can be a user agent.

What are the robot.txt directives?

Robots.txt files consist of one or more blocks of directives for the user agent to follow. Here are some of the typical instructions that you can find in a typical robots.txt.

User-agent directive

In robots.txt files, the user-agent directive can be used to specify which crawler should obey a specific set of rules.

This directive can be a wildcard to determine that rules apply to all crawlers:

User-agent: *

Or it can be the name of the crawler:

User-agent: Bingbot

Disallow directive

here are several instances where you would want to block search engines from indexing certain pages on your website.

For example, if you have uploaded some eBooks or PDFs that are exclusive to your website members, you would want to prevent them from becoming searchable on Google.

In these type of scenario, you should follow the user-agent line by one or more disallow directives:

User-agent: *

Disallow: /download-page

Using ‘disallow’ will block all URLs whose path starts with “/download-page,” such as the addresses below:

http://mywebsite.com/download-page

http://mywebsite.com/download-page?filetype=0

http://mywebsite.com/download-page/guides

http://mywebsite.com/download-pages-and-how-to-keep-them-out-of-search-results

This does not block any URL whose path does not start with “/download-page.” For example, the following URL will not be blocked:

http://mywebsite.com/downloads/download-page

You need to understand that ‘disallow’ is a simple text match.

Anything that follows after the “Disallow:” is treated as a simple string of text (except with * and $, which I’ll explain below).

This string is compared to the start of the path part of the URL (this means everything that follows after the first slash to the end of the URL) which is also treated as a simple string.

If there is a match, the URL is blocked. Otherwise, nothing is blocked.

Allow directive

The Allow directive lets you add exceptions to the list of Disallow directives.

This is particularly useful if you wish to exclude specific pages under a subdirectory from being blocked from crawling.

User-agent: *

Allow: /nothing-good-in-here/except-this-one-page

Disallow: /nothing-good-in-here/

Using Allow in this manner will block the following URLs:

http://mywebsite.com/nothing-good-in-here/

http://mywebsite.com/nothing-good-in-here/somepage

http://mywebsite.com/nothing-good-in-here/otherpage

http://mywebsite.com/nothing-good-in-here/?x=y

And allow these URLs to be crawled:

http://mywebsite.com/nothing-good-in-here/except-this-one-page

http://mywebsite.com/nothing-good-in-here/except-this-one-page-in-the-email

http://mywebsite.com/nothing-good-in-here/except-this-one-page/list-of-examples

http://mywebsite.com/nothing-good-in-here/except-this-one-page?a=b&c=d

Like the Disallow directive, Allow uses a simple text match. The text after the “Allow:” is matched to the beginning of the URL. If there is a match, the page will be allowed even when there is a disallow somewhere else that would typically block it.

Wildcards operator (*)

You can use a wildcard to set a variable directive for your robots.txt. This means any value that follows after your directive will be disallowed (or allowed). Here’s an example:

Disallow: /users/*/settings

The * (asterisk) tells the crawler to “match any of the following text.” The above directive will block off all the following URLs:

http://mywebsite.com/users/john/settings

http://mywebsite.com/users/robert/settings

http://mywebsite.com/users/john/extra/directory/levels/settings

http://mywebsite.com/users/john/search?q=/settings

http://mywebsite.com/users/john/settings-for-your-table

End-of-string operator ($)

Another useful extension is the end-of-string operator:

Disallow: /useless-page$

The $ symbol indicates that the URL must end at that point. This directive will block the following URL:

http://mywebsite.com/useless-page

But it will not block these pages:

http://mywebsite.com/useless-pages-and-how-to-avoid-creating-them

http://mywebsite.com/useless-page/

http://mywebsite.com/useless-page?a=b

Blocking everything

In some situations, you would want to block off your entire website from crawlers. Perhaps your site is still under construction and not ready for the limelight.

Or maybe you have two or more websites, and one of it is meant to be a mirror site for your homepage.

In any case, using this line will put your site into private, where it can’t be searched by Google:
User-agent: *

Disallow: /

Sitemap directive

This is an optional directive, but almost every robots.txt files will include a sitemap directive:

Sitemap: http://mywebsite.com/sitemap.xml

This line indicates the location of a sitemap file. When crawlers are pointed to the location of the sitemap, it cuts the crawling time.

This is especially useful if your website is newly constructed and has low crawl budget. Therefore this is an excellent directive to include if your site has an XML sitemap.

Things to avoid with robots.txt

Having a well-optimised robots.txt is all well and good. But there are some fatal mistakes you can commit to sabotage your SEO efforts.

The worst that can happen is your website becomes invisible to Google, which is always bad news for any business in today’s world.

Let’s look at some of the common errors people do with their robots.txt:

Robots.txt file is placed in the subdirectory folder

If you put your robots.txt in any location other than the root of the site directory, it will not work. Because when you don’t have access to the site root, you can’t use robots.txt.

Therefore, for situations where you only have access to the subdirectory and not the root folder, use meta tags instead.

Meta tags can also block crawlers and don’t require root access.

Disable site-wide blocking

After the launch of the site, some webmasters forget to delete their site-wide Disallow directive.

Always remember to double check your robots.txt file before your site goes live, or your site will disappear from Google.

Using secrecy terms in your directories

Sometimes we have pieces of information that we want to hide from prying eyes. But don’t name your subdirectories with obvious words like “secrets” or “passwords”.

Why? Search engines are not the only ones with crawlers. Spyware uses crawlers too, usually to do malicious activities like stealing email addresses.

When you put “secrets” in your robots.txt, it’s like wearing gold jewellery in the wrong part of town. You’ll raise red flags for malicious crawlers and draws unwanted attention to your website.

Blocking a directory instead of a page

If your URL directives are entered wrongly, you may end up accidentally blocking off an entire section of your website instead of just a page.

Once again, this can be avoided if you double-checked the entries in your Robots.txt file.

Inconsistent upper and lower case

Remember, paths are case sensitive.

For example:

Disallow: /home/

It doesn’t block “/Home/” or “/HOME/”.

This means if you wish to block them all three variations, you need to create a separate disallow line for each of them:

Disallow: /home/

Disallow: /Home/

Disallow: /HOME/

‘User-agent’ line mistakes

For starters, the user-agent line is vital to getting robots.txt to work. Without it, the entire file is useless.

Also, let’s say have three directories that need to be blocked off from all crawlers, and also one page that you want only Bing crawler to come through.

The obvious (but incorrect) approach might be to try something like this:

User-agent: *

Disallow: /admin/

Disallow: /private/

Disallow: /dontcrawl/

User-agent: Bingbot

Allow: /dontcrawl/exception

This file actually allows Bing to crawl everything on the site.

Bingbot, (and most other crawlers) will only obey the rules under the more specific user-agent line and will ignore all others.

In this example, it will obey only the rules directly under “User-agent: Bingbot” and will ignore the rules under “User-agent: *”.

To accomplish this goal, you need to repeat the same disallow rules for each user-agent block, like this:

User-agent: *

Disallow: /admin/

Disallow: /private/

Disallow: /dontcrawl/

User-agent: Bingbot

Disallow: /admin/

Disallow: /private/

Disallow: /dontcrawl/

Allow: /dontcrawl/exception

Misusing Crawl-Delay

Crawl-delay directive is an instruction that tells the crawler to delay crawling by the time which you have specified.

This is useful when you are running a website that is very huge in size. Other than that, the use of crawl-delay is limited, so it should be avoided if possible.

Disallow vs Allow: Which to use?

When in doubt, always stick to using disallow for your robots.txt.

Disallowing a page is the easiest way to try and prevent the bots crawling it directly.

Disallow won’t block all crawlers, such as in situations where the page has been linked from an external source, the bots will still come through and index the page.

Also, illegitimate bots will still crawl and index the content.

Using Robots.txt to Block Private Content

Some private content such as PDFs or ‘thank you’ pages are reachable via a Google search, even when you point these files away from it.

To prevent cases such as these from happening, the best and easiest method is to use the disallow directive, then place all of your private content behind a login page.

Of course, it means that it adds an extra step for your users, but your private content will remain protected.

Using Robots.txt to Hide Duplicate Content

Duplicate content should be avoidable as much as possible. But there are occasions where you have no choice but to use the same content.

Fortunately, Google and the other search engines are usually smart enough to know when you are genuinely using duplicate content for a legitimate purpose and not to game the system.

Despite that, there is still a chance that your content may be flagged as ill intention to deceive Google.

So here are three ways to deal with this kind of content:

Rewrite the Content – Creating new and relevant content will signal to the search engines to see your website as a trusted source.

301 Redirect – 301 redirects tells search engines that a page has moved to another location. Place a 301 to a page with duplicate content and redirect visitors to the original content on the site.

Rel= “canonical – This is a meta tag that informs Google of the original location of duplicated content; if you are running e-commerce website where the CMS often generates duplicate versions of the same URL, this will be very useful.

Other tips for using Robots.txt

Let’s take a look at some other stuff about robots.txt that may help you in your SEO strategy.

Misspelling directives

Nowadays, Google’s crawler is actually intelligent enough to accept any directives that are misspelt. For example, even if you spell ‘dissallow,’ Google will still pick it up as ‘disallow’.

Nevertheless, you shouldn’t be neglecting your spelling as not every crawler bot can detect misspelt directives.

Length of the line

Google prioritises directives based on the length of the lines. For example, take a look at this block:

User-agent: *

Allow: /userdir/homepage

Disallow: /userdir/

As the allow line is longer than the disallow line, Google’s crawler will actually obey the allow line and ignore the disallow line.

Blocking a specific query parameter

Let’s say you want to block off all URLs that include the query parameter “pd,” such as:

http://mywebsite.com/somepage?pd=123

http://mywebsite.com/somepage?a=b&pd=123

You might think of doing something like this:

Disallow: /*pd=

This will block out the URLs you want, but it will also block off all other query parameters that end with “pd”:

http://mywebsite.com/users?userpd=a0f3e8201b

http://mywebsite.com/auction?num=9172&bpd=1935.00

So how do you block “pd” without blocking “userpd” or “bpd”?

If you know “pd” will always be the first parameter, use a question mark, like this:

Disallow: /*?pd=

This directive will block:

http://mywebsite.com/somepage?pd=123

But it will not block:

http://mywebsite.com/somepage?a=b&pd=123

If you know “pd” will never be the first parameter, use an ampersand, like this:

Disallow: /*&pd=

This directive will block:

http://mywebsite.com/somepage?a=b&id=123

But it will not block:

http://mywebsite.com/somepage?id=123

The safest approach is to do both:

Disallow: /*?pd=

Disallow: /*&pd=

Unfortunately, there is no one single way to match both with just one line.

Blocking URLs that contain unsafe characters

Let’s say you need to block a URL that contains characters that are not URL safe.

This can often occur when server-side template code is exposed to the web by accident.

For example:

http://mywebsite.com/search?q=<% var_name %>

When you attempt to block that URL like this, it won’t work:
User-agent: *

Disallow: /search?q=<% var_name %>

Any crawler will automatically URL-encode any characters that are not URL-safe.

Why? Because the directive is actually using this URL:

http://mywebsite.com/search?q=%3C%%20var_name%20%%3E

Characters such as less-than or greater-than signs, double-quotes, single-quotes, and non-ASCII characters will be encoded.

The right way to block off any unsafe character URL is to block the encoded version:

User-agent: *

Disallow: /search?q=%3C%%20var_name%20%%3E

The quickest way to get the encoded version of the URL is to copy and paste the URL into your web browser’s address field and retrieve the URL.

In conclusion

You might feel overwhelmed by the amount of code or text that is required to build your robots.txt. But don’t worry, it is not as complicated as it looks.

If you break it down, there are only a few rules that you must follow. This makes things much less confusing.

So as long as you follow the rules, you will be able to improve your SEO with your robots.txt without screwing anything up

About Murray Dare

Murray Dare is a Marketing Consultant, Strategist and Director at Dare Media. Murray helps UK businesses find better ways to connect with their audiences through targeted content marketing strategies.