What is a robots.txt file?
A robots.txt file tells search engine crawlers which pages or files the crawler can or can't
request from your site.
What is robots.txt used for?
Robots.txt is used primarily to manage crawler traffic to your site, and occasionally to keep a
page off Google, depending on the file type. Remember that you shouldn't use robots.txt to
block access to private content: use proper authentication instead. URLs disallowed by the
robots.txt file might still be indexed without being crawled, and the robots.txt file can be
viewed by anyone, potentially disclosing the location of your private content.
Format and location rules:
● The file must be named robots.txt
● The robots.txt file must be located at the root of the website host that it applies to.
● The robots.txt file must be located at http://www.example.com/robots.txt.
● Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt,
robots.TXT, or otherwise).
● The /robots.txt file is a publicly available file.
● A robots.txt file can apply to subdomains (for example,
http://website.example.com/robots.txt) or on non-standard ports (for example,
http://example.com:8181/robots.txt).
● Use # to leave comments in your robots.txt file.
● robots.txt must be an ASCII or UTF-8 text file. No other characters are permitted.
● A robots.txt file consists of one or more rules.
● Each rule consists of multiple directives (instructions), one directive per line.
● A rule gives the following information:
○ Who the rule applies to (the user agent)
○ Which directories or files that agent can access, and/or
○ Which directories or files that agent cannot access.
● Rules are processed from top to bottom, and a user agent can match only one rule
set, which is the first, most-specific rule that matches a given user agent.
● Structure your robots.txt properly, like this: User-agent → Disallow → Allow → Host
→ Sitemap. This way, search engine spiders access categories and web pages in the
appropriate order.
● The default assumption is that a user agent can crawl any page or directory not
blocked by a Disallow: rule.
● Rules are case-sensitive. For instance, Disallow: /file.asp applies to
http://www.example.com/file.asp, but not http://www.example.com/FILE.asp.
● * is a wildcard that represents any sequence of characters
● The "User-agent: *" means this section applies to all robots.
● $ matches or designates the end of the URL
● The "Disallow: /" tells the robot that it should not visit any pages on the site.
● Create separate robots.txt files for different subdomains. For example,
“hubspot.com” and “blog.hubspot.com” have individual files with directory- and
page-specific directives.
● Don’t rely on robots.txt for security purposes. Use passwords and other security
mechanisms to protect your site from hacking, scraping, and data fraud.
What to Hide With Robots.txt
Robots.txt files can be used to exclude certain directories, categories, and pages from
search. Here are some pages you should hide using a robots.txt file:
● Pages with duplicate content
● Pagination pages
● Dynamic product and service pages
● Account pages
● Admin pages
● Shopping cart
● Chats
● Thank-you pages
Basically, it looks like this:
In the example above, we instruct Googlebot to avoid crawling and indexing all pages related
to user accounts, cart, and multiple dynamic pages that are generated when users look for
products in the search bar or sort them by price, and so on.
Eg:
In above example, we have disallowed WordPress pages and specific categories, but
wp-content files, JS plugins, CSS styles, and blog are allowed. This approach guarantees that
spiders crawl and index useful code and categories, firsthand.
The following directives are used in robots.txt files:
User-agent:
● [Required, one or more per rule] The name of a search engine robot (web crawler
software) that the rule applies to.
● This is the first line for any rule.
● Supports the * wildcard for a path prefix, suffix, or entire string.
● Using an asterisk (*) match all crawlers except the various AdsBot crawlers, which
must be named explicitly.
● Google Crawlers (User Agents) -
https://support.google.com/webmasters/answer/1061943
Disallow:
● [At least one or more Disallow or Allow entries per rule]
● A directory or page, relative to the root domain, that should not be crawled by the
user agent.
● If a page, it should be the full page name; if a directory, it should end in a / mark.
Supports the * wildcard for a path prefix, suffix, or entire string.
Allow:
● [At least one or more Disallow or Allow entries per rule]
● A directory or page, relative to the root domain, that should be crawled by the user
agent just mentioned.
● If a page, it should be the full page name
● If a directory, it should end in a / mark
● Supports the * wildcard for a path prefix, suffix, or entire string.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page
content. Note that Googlebot does not acknowledge this command, but crawl rate can be
set in Google Search Console. Bing defines crawl-delay as the size of a time window (from 1
to 30 seconds) during which BingBot will access a web site only once
Sitemap:
● [Optional, zero or more per file] The location of a sitemap for this website.
● Must be a fully-qualified URL; Google doesn't assume or check
http/https/www.non-www alternates.
● Sitemaps are a good way to indicate which content Google should crawl, as opposed
to which content it can or cannot crawl.
Sitemap: https://example.com/sitemap.xml
Sitemap: http://www.example.com/sitemap.xml
Robots.txt Wildcard Directive
Search engines such as Google and Bing allow the use of wildcards in robots.txt files so that
you don’t have to list a multitude of URL’s because they contain the same characters.
Disallow: *mobile
The above directive would block crawlers accessing any URLs on your website that contains
the term ‘mobile’, such as:
● /mobile
● /services/mobile-optimisation
● /blog/importance-of-mobile-ppc-bidding
● /images/mobile.jpg
● /phone/mobile34565.html
Another wildcard directive that you can use in your robots.txt is the “$” character.
Disallow: *.gif$
The example directive blocks crawlers from being able to access any URL that contains the
file type “.gif”. Wildcards can be extremely powerful and should be used carefully as with the
above example, the $ wildcard would block any file paths that also contain “.gif” such as
/my-files.gif/blog-posts.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Examples of Robots.txt
A.
# Rule 1
User-agent: Googlebot
Disallow: /nogooglebot/
# Rule 2
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap.xml
Explanation:
1. The user agent named "Googlebot" crawler should not crawl the folder
http://example.com/nogooglebot/or any subdirectories.
2. All other user agents can access the entire site. (This could have been omitted and
the result would be the same, as full access is the assumption.)
3. The site's Sitemap file is located at http://www.example.com/sitemap.xml
B. Block only Googlebot
User-agent: Googlebot
Disallow: /
C. Block Googlebot and Adsbot
User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /
D. Block all but AdsBot crawlers
User-agent: *
Disallow: /
E.
A robots.txt file consists of one or more blocks of rules, each beginning with a User-agent
line that specifies the target of the rules. Here is a file with two rules; inline comments
explain each rule:
# Block googlebot from example.com/directory1/... and example.com/directory2/...
# but allow access to directory2/subdirectory1/...
# All other directories on the site are allowed by default.
User-agent: googlebot
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/subdirectory1/
F. Disallow crawling of the entire website
User-agent: *
Disallow: /
G. Disallow crawling of a directory and its contents by following the directory name with a
forward slash.
User-agent: *
Disallow: /calendar/
Disallow: /junk/
H. Allow access to a single crawler Google News
User-agent: Googlebot-news
Allow: /
User-agent: *
Disallow: /
I. Allow access to all but a single crawler Googlebot News
User-agent: Googlebot-news
Disallow: /
User-agent: *
Allow: /
J. Disallow crawling of a single webpage by listing the page after the slash:
User-agent: *
Disallow: /private_file.html
K. Block a specific image from Google Images:
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
L. Block all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: /
M. Disallow crawling of files of a specific file type (for example, .gif):
User-agent: Googlebot
Disallow: /*.gif$
N. Disallow crawling of the entire site, but show AdSense ads on those pages (disallow all
web crawlers other than Mediapartners-Google. This implementation hides your pages from
search results, but the Mediapartners-Google web crawler can still analyze them to decide
what ads to show visitors to your site)
User-agent: *
Disallow: /
User-agent: Mediapartners-Google
Allow: /
O. Match URLs that end with a specific string (use $. For instance, the sample code blocks
any URLs that end with .xls)
User-agent: Googlebot
Disallow: /*.xls$
P. Allowing all web crawlers access to all content
User-agent: *
Disallow:
Q. All robots to stay out of a website or Blocking all web crawlers from all content
User-agent: *
Disallow: /
R. All robots not to enter three directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
S. All robots to stay away from one specific file.
User-agent: *
Disallow: /directory/file.html
T. Demonstrating how comments can be used (# Comments appear after the "#" symbol at
the start of a line, or after a directive)
User-agent: * # match all bots
Disallow: / # keep them out
U. Demonstrate multiple user-agents
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory
User-agent: googlebot-news # only the news service
Disallow: / # disallow everything
User-agent: * # any robot
Disallow: /something/ # disallow this directory
V. Blocking a specific web crawler from a specific folder
User-agent: Googlebot
Disallow: /example-subfolder/
W. Mark crawl delay for all crawlers
User-agent: *
Crawl-delay: 10
X. Allow a single robot
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Examples of valid robots.txt URLs
URL matching based on path values
What Are Meta Robots Tags?
Meta robots tags (REP tags) are elements of an indexer directive that tell search engine
spiders how to crawl and index specific pages on your website. They enable SEO
professionals to target individual pages and instruct crawlers with what to follow and what
not to follow. There are two types of robots meta directives: those that are part of the HTML
page (like the meta robotstag) and those that the web server sends as HTTP headers (such
as x-robots-tag).
<meta name=“robots” content=“[PARAMETER]”>
While the general <meta name=“robots” content=“[PARAMETER]”> tag is standard, you can
also provide directives to specific crawlers by replacing the “robots” with the name of a
specific user-agent. For example, to target a directive specifically to Googlebot, you’d use the
following code:
<meta name=“googlebot” content=“[DIRECTIVE]”>
Want to use more than one directive on a page? As long as they’re targeted to the same
“robot” (user-agent), multiple directives can be included in one meta directive – just
separate them by commas. Here’s an example:
<meta name=“robots” content=“noimageindex,” “nofollow,” “nosnippet”>
How to Use Meta Robots Tags?
There are only four major tag parameters:
● Follow
● Index
● Nofollow
● Noindex
Index, Follow: Allow search bots to index a page and follow its links
If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell
that out.
Noindex, Nofollow: Prevent search bots from indexing a page and following its links
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Index, Nofollow: Allow search engines to index a page but hide its links from search spiders
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
Noindex, Follow: Exclude a page from search but allow following its links (link juice helps
increase SERPs)
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
Block search engine crawlers
<meta name="robots" content="noindex">
Block non-search crawlers, such as AdsBot-Google
<meta name="AdsBot-Google" content="noindex">)
Prevent only Googlebot from crawling your page
<meta name="googlebot" content="noindex">
Show a page in Google's web search results, but not in Google News
<meta name="googlebot-news" content="noindex">
If you need to specify multiple crawlers individually, use multiple robots meta tags
<meta name="googlebot" content="noindex">
<meta name="googlebot-news" content="nosnippet">
The following direction means NOINDEX, NOFOLLOW
<META NAME="ROBOTS" CONTENT="NONE">
Here are some of the rarely used ones:
none
noarchive
nosnippet
unavailabe_after
noimageindex
nocache
noodp
notranslate
Indexation-controlling parameters:
● Noindex: Tells a search engine not to index a page. Do not show this page in search
results and do not show a "Cached" link in search results.
● Index: Tells a search engine to index a page. Note that you don’t need to add this
meta tag; it’s the default.
● Follow: Even if the page isn’t indexed, the crawler should follow all the links on a
page and pass equity to the linked pages.
● Nofollow: Tells a crawler not to follow any links on a page or pass along any link
equity.
● Noimageindex: Tells a crawler not to index any images on a page.
● None: Equivalent to using both the noindex and nofollow tags simultaneously.
● Noarchive: Search engines should not show a cached link to this page on a SERP.
● Nocache: Same as noarchive, but only used by Internet Explorer and Firefox.
● Nosnippet: Tells a search engine not to show a snippet of this page (i.e. meta
description) of this page on a SERP.
● Notranslate: Do not offer translation of this page in search results.
● Noodyp/noydir [OBSOLETE]: Prevents search engines from using a page’s DMOZ
description as the SERP snippet for this page. However, DMOZ was retired in early
2017, making this tag obsolete.
● NOODP: Prevents the Open Directory Project description for the page replacing the
description manually set for this page
● Unavailable_after: Search engines should no longer index this page after a particular
date.
Basic Rules for Setting Up Meta Robots Tags
● Like any <META> tag it should be placed in the HEAD section of an HTML page
● Be case sensitive. Google and other search engines may recognize attributes, values,
and parameters in both uppercase and lowercase, and you can switch between the
two if you want. I strongly recommend that you stick to one option to improve code
readability.
● Avoid multiple <meta> tags. By doing this, you’ll avoid conflicts in code. Use multiple
values in your <meta> tag. Like this: <meta name=“robots” content=“noindex,
nofollow”>.
● Don’t use conflicting meta tags to avoid indexing mistakes. For example, if you have
several code lines with meta tags like this <meta name=“robots” content=“follow”>
and this <meta name=“robots” content=“nofollow”>, only “nofollow” will be taken
into account. This is because robots put restrictive values first.
● The basic rule here is, restrictive values take precedent. So, if you “allow” indexing of
a specific page in a robots.txt file but accidentally “noindex” it in the <meta>, spiders
won’t index the page.
● Also, remember: If you want to give instructions specifically to Google, use the
<meta> “googlebot” instead of “robots”. Like this: <meta name=“googlebot”
content=“nofollow”>. It is similar to “robots” but avoids all the other search crawlers.
X-Robots-Tag
While the meta robots tag allows you to control indexing behavior at the page level, the
x-robots-tag can be included as part of the HTTP header to control indexing of a page as a
whole, as well as very specific elements of a page.
To use the x-robots-tag, you’ll need to have access to either your website’s header .php,
.htaccess, or server access file. From there, add your specific server configuration
x-robots-tag markup, including any parameters.
You do not need to use both meta robots and the x-robots-tag on the same page – doing so
would be redundant.
Here are a few use cases for why you might employ the
x-robots-tag:
● Controlling the indexation of content not written in HTML (like flash or video)
● Blocking indexation of a particular element of a page (like an image or video), but not
of the entire page itself
● Controlling indexation if you don’t have access to a page’s HTML (specifically, to the
<head> section) or if your site uses a global header that cannot be changed
● Adding rules to whether or not a page should be indexed (ex. If a user has
commented over 20 times, index their profile page)
Using the X-Robots-Tag HTTP header
Example of an HTTP response with an X-Robots-Tag instructing crawlers not to index a page:
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex
(…)
Multiple X-Robots-Tag headers can be combined within the HTTP response, or you can
specify a comma-separated list of directives.
Here's an example of an HTTP header response which has a noarchive X-Robots-Tag
combined with an unavailable_after X-Robots-Tag.
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noarchive
X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST
(…)
The following set of X-Robots-Tag HTTP headers can be used to conditionally allow showing
of a page in search results for different search engines:
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: otherbot: noindex, nofollow
(…)