0% found this document useful (0 votes)

187 views21 pages

Robots

The document provides information on what a robots.txt file is, how it is used to manage crawler traffic and block content from search engines, and the formatting and location rules for robots.txt files. It details the directives that can be used in robots.txt including User-agent, Disallow, Allow and more. Examples are given for common ways robots.txt is used such as blocking specific directories, files, or entire sites from crawlers.

Uploaded by

Krutika Modi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

187 views21 pages

Robots

Uploaded by

Krutika Modi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

What is a robots.txt file?

A robots.txt file tells search engine crawlers which pages or files the crawler can or can't
request from your site.

What is robots.txt used for?

Robots.txt is used primarily to manage crawler traffic to your site, and occasionally to keep a
page off Google, depending on the file type. Remember that you shouldn't use robots.txt to
block access to private content: use proper authentication instead. URLs disallowed by the
robots.txt file might still be indexed without being crawled, and the robots.txt file can be
viewed by anyone, potentially disclosing the location of your private content.

Format and location rules:

● The file must be named robots.txt
● The robots.txt file must be located at the root of the website host that it applies to.
● The robots.txt file must be located at http://www.example.com/robots.txt.
● Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt,
robots.TXT, or otherwise).
● The /robots.txt file is a publicly available file.
● A robots.txt file can apply to subdomains (for example,
http://website.example.com/robots.txt) or on non-standard ports (for example,
http://example.com:8181/robots.txt).
● Use # to leave comments in your robots.txt file.
● robots.txt must be an ASCII or UTF-8 text file. No other characters are permitted.
● A robots.txt file consists of one or more rules.
● Each rule consists of multiple directives (instructions), one directive per line.
● A rule gives the following information:
○ Who the rule applies to (the user agent)
○ Which directories or files that agent can access, and/or
○ Which directories or files that agent cannot access.
● Rules are processed from top to bottom, and a user agent can match only one rule
set, which is the first, most-specific rule that matches a given user agent.
● Structure your robots.txt properly, like this: User-agent → Disallow → Allow → Host
→ Sitemap. This way, search engine spiders access categories and web pages in the
appropriate order.
● The default assumption is that a user agent can crawl any page or directory not
blocked by a Disallow: rule.
● Rules are case-sensitive. For instance, Disallow: /file.asp applies to
http://www.example.com/file.asp, but not http://www.example.com/FILE.asp.
● * is a wildcard that represents any sequence of characters
● The "User-agent: *" means this section applies to all robots.
● $ matches or designates the end of the URL
● The "Disallow: /" tells the robot that it should not visit any pages on the site.
● Create separate robots.txt files for different subdomains. For example,
“hubspot.com” and “blog.hubspot.com” have individual files with directory- and
page-specific directives.
● Don’t rely on robots.txt for security purposes. Use passwords and other security
mechanisms to protect your site from hacking, scraping, and data fraud.

What to Hide With Robots.txt

Robots.txt files can be used to exclude certain directories, categories, and pages from
search. Here are some pages you should hide using a robots.txt file:

● Pages with duplicate content

● Pagination pages
● Dynamic product and service pages
● Account pages
● Admin pages
● Shopping cart
● Chats
● Thank-you pages

Basically, it looks like this:

In the example above, we instruct Googlebot to avoid crawling and indexing all pages related
to user accounts, cart, and multiple dynamic pages that are generated when users look for
products in the search bar or sort them by price, and so on.

Eg:

In above example, we have disallowed WordPress pages and specific categories, but
wp-content files, JS plugins, CSS styles, and blog are allowed. This approach guarantees that
spiders crawl and index useful code and categories, firsthand.

The following directives are used in robots.txt files:

User-agent:

● [Required, one or more per rule] The name of a search engine robot (web crawler
software) that the rule applies to.
● This is the first line for any rule.
● Supports the * wildcard for a path prefix, suffix, or entire string.
● Using an asterisk (*) match all crawlers except the various AdsBot crawlers, which
must be named explicitly.
● Google Crawlers (User Agents) -
https://support.google.com/webmasters/answer/1061943

Disallow:

● [At least one or more Disallow or Allow entries per rule]

● A directory or page, relative to the root domain, that should not be crawled by the
user agent.
● If a page, it should be the full page name; if a directory, it should end in a / mark.
Supports the * wildcard for a path prefix, suffix, or entire string.

Allow:

● [At least one or more Disallow or Allow entries per rule]

● A directory or page, relative to the root domain, that should be crawled by the user
agent just mentioned.
● If a page, it should be the full page name
● If a directory, it should end in a / mark
● Supports the * wildcard for a path prefix, suffix, or entire string.

Crawl-delay: How many seconds a crawler should wait before loading and crawling page
content. Note that Googlebot does not acknowledge this command, but crawl rate can be
set in Google Search Console. Bing defines crawl-delay as the size of a time window (from 1
to 30 seconds) during which BingBot will access a web site only once
Sitemap:

● [Optional, zero or more per file] The location of a sitemap for this website.
● Must be a fully-qualified URL; Google doesn't assume or check
http/https/www.non-www alternates.
● Sitemaps are a good way to indicate which content Google should crawl, as opposed
to which content it can or cannot crawl.

Sitemap: https://example.com/sitemap.xml

Sitemap: http://www.example.com/sitemap.xml

Robots.txt Wildcard Directive

Search engines such as Google and Bing allow the use of wildcards in robots.txt files so that
you don’t have to list a multitude of URL’s because they contain the same characters.

Disallow: *mobile

The above directive would block crawlers accessing any URLs on your website that contains
the term ‘mobile’, such as:

● /mobile
● /services/mobile-optimisation
● /blog/importance-of-mobile-ppc-bidding
● /images/mobile.jpg
● /phone/mobile34565.html

Another wildcard directive that you can use in your robots.txt is the “$” character.

Disallow: *.gif$

The example directive blocks crawlers from being able to access any URL that contains the
file type “.gif”. Wildcards can be extremely powerful and should be used carefully as with the
above example, the $ wildcard would block any file paths that also contain “.gif” such as
/my-files.gif/blog-posts.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]

Examples of Robots.txt
A.

# Rule 1
User-agent: Googlebot
Disallow: /nogooglebot/

# Rule 2
User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Explanation:
1. The user agent named "Googlebot" crawler should not crawl the folder
http://example.com/nogooglebot/or any subdirectories.
2. All other user agents can access the entire site. (This could have been omitted and
the result would be the same, as full access is the assumption.)
3. The site's Sitemap file is located at http://www.example.com/sitemap.xml

B. Block only Googlebot

User-agent: Googlebot

Disallow: /

C. Block Googlebot and Adsbot

User-agent: Googlebot

User-agent: AdsBot-Google

Disallow: /
D. Block all but AdsBot crawlers

User-agent: *

Disallow: /

A robots.txt file consists of one or more blocks of rules, each beginning with a User-agent
line that specifies the target of the rules. Here is a file with two rules; inline comments
explain each rule:

# Block googlebot from example.com/directory1/... and example.com/directory2/...

# but allow access to directory2/subdirectory1/...

# All other directories on the site are allowed by default.

User-agent: googlebot
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/subdirectory1/

F. Disallow crawling of the entire website

User-agent: *
Disallow: /

G. Disallow crawling of a directory and its contents by following the directory name with a
forward slash.

User-agent: *
Disallow: /calendar/
Disallow: /junk/

H. Allow access to a single crawler Google News

User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /
I. Allow access to all but a single crawler Googlebot News

User-agent: Googlebot-news
Disallow: /

User-agent: *
Allow: /

J. Disallow crawling of a single webpage by listing the page after the slash:

User-agent: *
Disallow: /private_file.html

K. Block a specific image from Google Images:

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

L. Block all images on your site from Google Images:

User-agent: Googlebot-Image
Disallow: /

M. Disallow crawling of files of a specific file type (for example, .gif):

User-agent: Googlebot
Disallow: /*.gif$

N. Disallow crawling of the entire site, but show AdSense ads on those pages (disallow all
web crawlers other than Mediapartners-Google. This implementation hides your pages from
search results, but the Mediapartners-Google web crawler can still analyze them to decide
what ads to show visitors to your site)

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /

O. Match URLs that end with a specific string (use $. For instance, the sample code blocks
any URLs that end with .xls)
User-agent: Googlebot
Disallow: /*.xls$

P. Allowing all web crawlers access to all content

User-agent: *
Disallow:

Q. All robots to stay out of a website or Blocking all web crawlers from all content

User-agent: *
Disallow: /

R. All robots not to enter three directories

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

S. All robots to stay away from one specific file.

User-agent: *
Disallow: /directory/file.html

T. Demonstrating how comments can be used (# Comments appear after the "#" symbol at
the start of a line, or after a directive)

User-agent: * # match all bots

Disallow: / # keep them out

U. Demonstrate multiple user-agents

User-agent: googlebot # all Google services

Disallow: /private/ # disallow this directory
User-agent: googlebot-news # only the news service
Disallow: / # disallow everything

User-agent: * # any robot

Disallow: /something/ # disallow this directory

V. Blocking a specific web crawler from a specific folder

User-agent: Googlebot
Disallow: /example-subfolder/

W. Mark crawl delay for all crawlers

User-agent: *
Crawl-delay: 10

X. Allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /
Examples of valid robots.txt URLs
URL matching based on path values
What Are Meta Robots Tags?

Meta robots tags (REP tags) are elements of an indexer directive that tell search engine
spiders how to crawl and index specific pages on your website. They enable SEO
professionals to target individual pages and instruct crawlers with what to follow and what
not to follow. There are two types of robots meta directives: those that are part of the HTML
page (like the meta robotstag) and those that the web server sends as HTTP headers (such
as x-robots-tag).

<meta name=“robots” content=“[PARAMETER]”>

While the general <meta name=“robots” content=“[PARAMETER]”> tag is standard, you can
also provide directives to specific crawlers by replacing the “robots” with the name of a
specific user-agent. For example, to target a directive specifically to Googlebot, you’d use the
following code:

<meta name=“googlebot” content=“[DIRECTIVE]”>

Want to use more than one directive on a page? As long as they’re targeted to the same
“robot” (user-agent), multiple directives can be included in one meta directive – just
separate them by commas. Here’s an example:

<meta name=“robots” content=“noimageindex,” “nofollow,” “nosnippet”>

How to Use Meta Robots Tags?

There are only four major tag parameters:

● Follow
● Index
● Nofollow
● Noindex

Index, Follow: Allow search bots to index a page and follow its links

If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell
that out.

Noindex, Nofollow: Prevent search bots from indexing a page and following its links

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Index, Nofollow: Allow search engines to index a page but hide its links from search spiders

<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">

Noindex, Follow: Exclude a page from search but allow following its links (link juice helps
increase SERPs)

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

Block search engine crawlers

Block non-search crawlers, such as AdsBot-Google

<meta name="AdsBot-Google" content="noindex">)

Prevent only Googlebot from crawling your page

<meta name="googlebot" content="noindex">

Show a page in Google's web search results, but not in Google News

<meta name="googlebot-news" content="noindex">

If you need to specify multiple crawlers individually, use multiple robots meta tags

<meta name="googlebot" content="noindex">

<meta name="googlebot-news" content="nosnippet">

The following direction means NOINDEX, NOFOLLOW

<META NAME="ROBOTS" CONTENT="NONE">

Here are some of the rarely used ones:

none
noarchive
nosnippet
unavailabe_after
noimageindex
nocache
noodp
notranslate

Indexation-controlling parameters:
● Noindex: Tells a search engine not to index a page. Do not show this page in search
results and do not show a "Cached" link in search results.
● Index: Tells a search engine to index a page. Note that you don’t need to add this
meta tag; it’s the default.
● Follow: Even if the page isn’t indexed, the crawler should follow all the links on a
page and pass equity to the linked pages.
● Nofollow: Tells a crawler not to follow any links on a page or pass along any link
equity.
● Noimageindex: Tells a crawler not to index any images on a page.
● None: Equivalent to using both the noindex and nofollow tags simultaneously.
● Noarchive: Search engines should not show a cached link to this page on a SERP.
● Nocache: Same as noarchive, but only used by Internet Explorer and Firefox.
● Nosnippet: Tells a search engine not to show a snippet of this page (i.e. meta
description) of this page on a SERP.
● Notranslate: Do not offer translation of this page in search results.
● Noodyp/noydir [OBSOLETE]: Prevents search engines from using a page’s DMOZ
description as the SERP snippet for this page. However, DMOZ was retired in early
2017, making this tag obsolete.
● NOODP: Prevents the Open Directory Project description for the page replacing the
description manually set for this page
● Unavailable_after: Search engines should no longer index this page after a particular
date.

Basic Rules for Setting Up Meta Robots Tags

● Like any <META> tag it should be placed in the HEAD section of an HTML page
● Be case sensitive. Google and other search engines may recognize attributes, values,
and parameters in both uppercase and lowercase, and you can switch between the
two if you want. I strongly recommend that you stick to one option to improve code
readability.
● Avoid multiple <meta> tags. By doing this, you’ll avoid conflicts in code. Use multiple
values in your <meta> tag. Like this: <meta name=“robots” content=“noindex,
nofollow”>.
● Don’t use conflicting meta tags to avoid indexing mistakes. For example, if you have
several code lines with meta tags like this <meta name=“robots” content=“follow”>
and this <meta name=“robots” content=“nofollow”>, only “nofollow” will be taken
into account. This is because robots put restrictive values first.
● The basic rule here is, restrictive values take precedent. So, if you “allow” indexing of
a specific page in a robots.txt file but accidentally “noindex” it in the <meta>, spiders
won’t index the page.
● Also, remember: If you want to give instructions specifically to Google, use the
<meta> “googlebot” instead of “robots”. Like this: <meta name=“googlebot”
content=“nofollow”>. It is similar to “robots” but avoids all the other search crawlers.

X-Robots-Tag

While the meta robots tag allows you to control indexing behavior at the page level, the
x-robots-tag can be included as part of the HTTP header to control indexing of a page as a
whole, as well as very specific elements of a page.
To use the x-robots-tag, you’ll need to have access to either your website’s header .php,
.htaccess, or server access file. From there, add your specific server configuration
x-robots-tag markup, including any parameters.
You do not need to use both meta robots and the x-robots-tag on the same page – doing so
would be redundant.
Here are a few use cases for why you might employ the
x-robots-tag:

● Controlling the indexation of content not written in HTML (like flash or video)
● Blocking indexation of a particular element of a page (like an image or video), but not
of the entire page itself
● Controlling indexation if you don’t have access to a page’s HTML (specifically, to the
<head> section) or if your site uses a global header that cannot be changed
● Adding rules to whether or not a page should be indexed (ex. If a user has
commented over 20 times, index their profile page)

Using the X-Robots-Tag HTTP header

Example of an HTTP response with an X-Robots-Tag instructing crawlers not to index a page:

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex

(…)

Multiple X-Robots-Tag headers can be combined within the HTTP response, or you can
specify a comma-separated list of directives.

Here's an example of an HTTP header response which has a noarchive X-Robots-Tag

combined with an unavailable_after X-Robots-Tag.
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noarchive
X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST
(…)

The following set of X-Robots-Tag HTTP headers can be used to conditionally allow showing
of a page in search results for different search engines:

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: otherbot: noindex, nofollow
(…)

Robots
No ratings yet
Robots
14 pages
SEO & Cloaking Techniques Guide
No ratings yet
SEO & Cloaking Techniques Guide
22 pages
Google Dork Report
100% (1)
Google Dork Report
7 pages
Review Webserver Metafiles For Information Leakage
No ratings yet
Review Webserver Metafiles For Information Leakage
5 pages
Site Map
No ratings yet
Site Map
2 pages
Understanding Googlebot and Crawling
No ratings yet
Understanding Googlebot and Crawling
33 pages
Essential Technical SEO Tools
No ratings yet
Essential Technical SEO Tools
43 pages
The Text File That Runs The Internet
No ratings yet
The Text File That Runs The Internet
11 pages
Robot
No ratings yet
Robot
5 pages
How Search Engines Work: Crawling, Indexing, and Ranking - Beginner's Guide To SEO - Moz
No ratings yet
How Search Engines Work: Crawling, Indexing, and Ranking - Beginner's Guide To SEO - Moz
47 pages
Robots.txt Configuration Guide
No ratings yet
Robots.txt Configuration Guide
2 pages
SEO Robots
No ratings yet
SEO Robots
3 pages
Understanding Robot Exclusion Protocol
100% (1)
Understanding Robot Exclusion Protocol
133 pages
How Search Engines Work: SEO Basics
No ratings yet
How Search Engines Work: SEO Basics
59 pages
Bots 2025
No ratings yet
Bots 2025
21 pages
XML Sitemap and Robots.txt Guide
No ratings yet
XML Sitemap and Robots.txt Guide
19 pages
Robots
No ratings yet
Robots
1 page
Technical SEO
No ratings yet
Technical SEO
3 pages
XML Site Map:: Free Online Sitemap Generator
No ratings yet
XML Site Map:: Free Online Sitemap Generator
2 pages
What Is Crawlability?
100% (1)
What Is Crawlability?
7 pages
Cache, Cookies, and Sessions Explained
No ratings yet
Cache, Cookies, and Sessions Explained
7 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
Drupal Robots - TXT SEO
No ratings yet
Drupal Robots - TXT SEO
7 pages
Web Developer's SEO Cheat Sheet
100% (1)
Web Developer's SEO Cheat Sheet
9 pages
SEO Basics for Beginners Guide
No ratings yet
SEO Basics for Beginners Guide
24 pages
Understanding Crawlability in SEO
No ratings yet
Understanding Crawlability in SEO
8 pages
Seo Guide
100% (1)
Seo Guide
25 pages
SN SEO Training Class2
No ratings yet
SN SEO Training Class2
29 pages
GWP-Technical SEO-251022-111228
No ratings yet
GWP-Technical SEO-251022-111228
6 pages
Understanding Website Architecture for SEO
No ratings yet
Understanding Website Architecture for SEO
45 pages
Crawler, Index, Ranking
No ratings yet
Crawler, Index, Ranking
20 pages
Technical SEO All Topics Full
No ratings yet
Technical SEO All Topics Full
6 pages
Seo Cheat Sheet
No ratings yet
Seo Cheat Sheet
6 pages
Technical SEO Guide by Charles Floate
No ratings yet
Technical SEO Guide by Charles Floate
41 pages
Search Engine Optimisation - Have You Been Crawled Over?
No ratings yet
Search Engine Optimisation - Have You Been Crawled Over?
46 pages
Google Hacking Techniques Overview
100% (1)
Google Hacking Techniques Overview
38 pages
The Amount of Code That Developers
No ratings yet
The Amount of Code That Developers
1 page
WWW Screamingfrog ...
No ratings yet
WWW Screamingfrog ...
16 pages
Web Scraping
No ratings yet
Web Scraping
51 pages
The Web Developer's SEO Cheat Sheet: Important HTML Elements Webmaster Tools Best Practices
No ratings yet
The Web Developer's SEO Cheat Sheet: Important HTML Elements Webmaster Tools Best Practices
6 pages
Seo Cheat Sheet PDF
100% (2)
Seo Cheat Sheet PDF
6 pages
SEO Cheat Sheet
No ratings yet
SEO Cheat Sheet
6 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Robots Meta Tag and 301 Redirect Guide
No ratings yet
Robots Meta Tag and 301 Redirect Guide
2 pages
Poster 1034
No ratings yet
Poster 1034
2 pages
Lesson 4
No ratings yet
Lesson 4
33 pages
SEO Cheat Sheet by Moz
93% (27)
SEO Cheat Sheet by Moz
4 pages
Web Crawling & SEO Essentials
No ratings yet
Web Crawling & SEO Essentials
20 pages
SEO HTML Tags for Web Developers
100% (1)
SEO HTML Tags for Web Developers
2 pages
SEO Spider Configuration - Screaming Frog
No ratings yet
SEO Spider Configuration - Screaming Frog
55 pages
How Google Search Engine Works
No ratings yet
How Google Search Engine Works
61 pages
SEO
No ratings yet
SEO
7 pages
Webchat Cam Sex
No ratings yet
Webchat Cam Sex
23 pages
11 June 2025 06:09: Seo Page 1
No ratings yet
11 June 2025 06:09: Seo Page 1
14 pages
Exp 5
No ratings yet
Exp 5
3 pages
Owasp Testing Guide v4 OTF INFO 001
No ratings yet
Owasp Testing Guide v4 OTF INFO 001
3 pages
SEO For HubSpot CMS Developers Slide Deck
No ratings yet
SEO For HubSpot CMS Developers Slide Deck
70 pages
Everything You Need Toknowabout Airpel Anti-Stiction Aircylinders
No ratings yet
Everything You Need Toknowabout Airpel Anti-Stiction Aircylinders
24 pages
ZPD & ZRD Copeland Scroll Digital Compressor Range For R410A and R407C
No ratings yet
ZPD & ZRD Copeland Scroll Digital Compressor Range For R410A and R407C
3 pages
Guide For Writing Requirements
100% (1)
Guide For Writing Requirements
110 pages
Layout Kubikel Sirimau Pix Schneider - Po110-201 - 01
No ratings yet
Layout Kubikel Sirimau Pix Schneider - Po110-201 - 01
11 pages
Soal Bahasa Inggris
No ratings yet
Soal Bahasa Inggris
5 pages
Natural Sciences Education - 2018 - Cope - Developing and Evaluating An ESRI Story Map As An Educational Tool
No ratings yet
Natural Sciences Education - 2018 - Cope - Developing and Evaluating An ESRI Story Map As An Educational Tool
9 pages
National Knowledge Commission
No ratings yet
National Knowledge Commission
22 pages
E-Marketing & Crisis Management in Hotels
No ratings yet
E-Marketing & Crisis Management in Hotels
89 pages
Design and Control of 6 DOF Robotic Manipulator: Thesis
No ratings yet
Design and Control of 6 DOF Robotic Manipulator: Thesis
66 pages
Hindustan Times 27-11-2025
No ratings yet
Hindustan Times 27-11-2025
28 pages
Tapping Call
No ratings yet
Tapping Call
3 pages
473457
No ratings yet
473457
10 pages
Java Map Interface and HashMap Overview
No ratings yet
Java Map Interface and HashMap Overview
34 pages
VLSI Lab Manuals
No ratings yet
VLSI Lab Manuals
88 pages
Consolidated Syllabus-Open Elective
No ratings yet
Consolidated Syllabus-Open Elective
53 pages
Statistical Learning Concepts Explained
No ratings yet
Statistical Learning Concepts Explained
5 pages
Deep Residual U Net For Automatic Detection OfMoroccan Coastal Upwelling Using SST Images
No ratings yet
Deep Residual U Net For Automatic Detection OfMoroccan Coastal Upwelling Using SST Images
5 pages
KASPL Plumbing Measurement Sheet
No ratings yet
KASPL Plumbing Measurement Sheet
2 pages
Computer Software: Mcgraw-Hill/Irwin
No ratings yet
Computer Software: Mcgraw-Hill/Irwin
49 pages
Mstar ISP Tool Upgrade Instructions
No ratings yet
Mstar ISP Tool Upgrade Instructions
13 pages
Neelam Choudhary: Profile
No ratings yet
Neelam Choudhary: Profile
3 pages
Zoran Tosic - IT Associate at Faculty of Medicine
No ratings yet
Zoran Tosic - IT Associate at Faculty of Medicine
1 page
Cec364 Syllabus
No ratings yet
Cec364 Syllabus
1 page
Design of PDN
No ratings yet
Design of PDN
13 pages
Content Beyond Syllabus For DSD
No ratings yet
Content Beyond Syllabus For DSD
4 pages
En How To Buy Meme Coins Before They 100x Your Ultimate Guide
No ratings yet
En How To Buy Meme Coins Before They 100x Your Ultimate Guide
13 pages
Teacher's Guide GR4 - CompuTech
No ratings yet
Teacher's Guide GR4 - CompuTech
116 pages
Analytical Skills for Managers
No ratings yet
Analytical Skills for Managers
9 pages
T24 Enquiries
100% (1)
T24 Enquiries
45 pages
Description and Discussion On Dcase 2025 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection For Machine Condition Monitoring
No ratings yet
Description and Discussion On Dcase 2025 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection For Machine Condition Monitoring
4 pages

Robots

Uploaded by

Robots

Uploaded by

What is a robots.txt file?

What is robots.txt used for?

Format and location rules:

What to Hide With Robots.txt

● Pages with duplicate content

Basically, it looks like this:

The following directives are used in robots.txt files:

● [At least one or more Disallow or Allow entries per rule]

● [At least one or more Disallow or Allow entries per rule]

Robots.txt Wildcard Directive

B. Block only Googlebot

C. Block Googlebot and Adsbot

# Block googlebot from example.com/directory1/... and example.com/directory2/...

# but allow access to directory2/subdirectory1/...

# All other directories on the site are allowed by default.

F. Disallow crawling of the entire website

H. Allow access to a single crawler Google News

K. Block a specific image from Google Images:

L. Block all images on your site from Google Images:

M. Disallow crawling of files of a specific file type (for example, .gif):

P. Allowing all web crawlers access to all content

R. All robots not to enter three directories

S. All robots to stay away from one specific file.

User-agent: * # match all bots

U. Demonstrate multiple user-agents

User-agent: googlebot # all Google services

User-agent: * # any robot

V. Blocking a specific web crawler from a specific folder

W. Mark crawl delay for all crawlers

X. Allow a single robot

<meta name=“robots” content=“[PARAMETER]”>

<meta name=“googlebot” content=“[DIRECTIVE]”>

<meta name=“robots” content=“noimageindex,” “nofollow,” “nosnippet”>

How to Use Meta Robots Tags?

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

Block search engine crawlers

Block non-search crawlers, such as AdsBot-Google

<meta name="AdsBot-Google" content="noindex">)

Prevent only Googlebot from crawling your page

<meta name="googlebot" content="noindex">

<meta name="googlebot-news" content="noindex">

<meta name="googlebot" content="noindex">

<meta name="googlebot-news" content="nosnippet">

The following direction means NOINDEX, NOFOLLOW

<META NAME="ROBOTS" CONTENT="NONE">

Here are some of the rarely used ones:

Basic Rules for Setting Up Meta Robots Tags

Using the X-Robots-Tag HTTP header

Here's an example of an HTTP header response which has a noarchive X-Robots-Tag

You might also like