Robots.txt disallow
It’s very important to know that the “Disallow” command in your WordPress robots.txt file doesn’t function exactly same as the noindex
meta tag on a page’s header. Your robots.txt blocks crawling, but not necessarily indexing with the exception of website files such as images and documents. Search engines still can index your “disallowed” pages if they’re linked from elsewhere.
So Prevent Direct Access Gold no longer uses robots.txt Disallow rules to block your website pages from search indexing. Instead, we make use of noindex
meta tag which also helps Google and other search engines distribute its inbound links value for your content across your website correctly.
What to include in your WordPress robots.txt?
Yoast suggests keeping your robots.txt clean and not blocking anything including any of the following:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-content/plugins/
Disallow: /wp-includes/
WordPress also agrees saying the ideal robots.txt shouldn’t disallow anything at all. As a matter of fact, the /wp-content/plugins/
and /wp-includes/
directories contain images, JavaScript or CSS files that your themes and plugins probably use to display your website correctly. Blocking these directories means all scripts, styles and images come with your plugins and WordPress are blocked as well making it harder for Google and other search engines’ crawler to analyze and understand your website content. Likewise, you should never block your /wp-content/themes/
either.
In short, disallowing your WordPress resources, uploads and plugins directory, which many claim to enhance your website’s security against anyone targeting vulnerable plugins to exploit, probably does more harm than good especially in terms of SEO. You shouldn’t install those plugins in the first place.
That’s why we’ve removed these rules from your robots.txt by default. However, you might still want to include them anyway with our WordPress Robots.txt Integration extension.
Sitemap XML
While Yoast also highly recommends that you manually submit your XML sitemap to Google Search Console and Bing Webmaster Tools directly, you may still want to include a sitemap
directive on your robots.txt as a quick alternative guiding other search engines where your sitemap is.
Sitemap: http://preventdirectaccess.com/post-sitemap.xml
Sitemap: http://preventdirectaccess.com/page-sitemap.xml
Sitemap: http://preventdirectaccess.com/author-sitemap.xml
Sitemap: http://preventdirectaccess.com/offers-sitemap.xml
Block access to Readme.html, licence.txt and wp-config-sample.php files
Security-wise, it’s recommended that you block access to your WordPress readme.html, licence.txt and wp-config-sample.php files so that unauthorized people won’t be able to check and see which version of WordPress you’re using.
User-agent: *
Disallow: /readme.html
Disallow: /licence.txt
Disallow: /licence.txt
You may also use robots.txt to block specific bots from crawling your website content or specify different rules for different types of bots.
# block Googlebot from crawling the entire website
User-agent: Googlebot
Disallow: /
# block Bingbot from crawling refer directory
User-agent: Bingbot
Disallow: /refer/
This is how to you can stop bots from crawling WordPress search results
User-agent: *
Disallow: /?s=
Disallow: /search/
Host
& Crawl-delay
are other robots.txt directives that you may consider using albeit less popular. The first directive lets you specify the preferred domain of your website (www or non-www):
User-agent: *
#we prefer non-www domain
host: preventdirectaccess.com
The later tells crawl-hungry bots of various search engines to wait for a number of seconds before each crawl.
User-agent: *
#please wait for 8 seconds before the next crawl
crawl-delay: 8