WP SEO Tip: robot.txt

Last Update: October 28, 2010

WP SEO Tip: robot.txt

I was submitting site map links in Google Webmaster Central and came across a problem that brought to my attention what a poor job WP does with the robot.txt function.

*  If you have your blog set public and do not choose the option of adding your Google site map URL to the robot file, no robot file will be created by WordPress.

*  If you choose to add the URL to the site map then WP creates a "virtual" robot.txt file that looks like this:

User-agent: *
Disallow:

YourDomainURL/sitemap.xml.gz

What this does is allow all robots to crawl your site, the good and the bad, and lets them know where your zipped Google sitemap can be found.

* If you chose to make the site private while construction is going on, and you add a URL for the site map to the robot.txt file then WP creates a "virtual" robot.txt file that looks like this:

User-agent: *
Disallow:/

YourRootDomain/sitemap.xml.gz

This, because of the "/" after disallow, will disallow all robots because it does not make a specification after the "/".  Therefore, you will not be indexed by Google or any other search engine.

So, for the standard WordPress install, that is it:

1- No robot.txt file
2- A Virtual Robot file allowing everyone to crawl
3- A Virtual Robot file shutting the doors and locking everyone out!

Now, the robot.txt file is an important SEO feature not to be overlooked.

It can block unwanted "Bad" robots from crawling your site looking for email addresses and information you would not want known.

It can allow the "Good" robots to have access to your site.

However, do you want the Good Guys to have access to everything?

On a standard WP blog allowing posters, you will have a persons post, which will also show up on a category page, on a tag page, on an author page, ... and Google sees all of this, and from what I have read, can cause some duplicate content issues.

There is also a lot of items Google or any of the other Good Guys have no need to see, such as login files or registration files, just as an example.

So a properly constructed robot.txt file can add to your goal of good SEO for you site and help it achieve better rankings.

There are free WP Plugins (just free from WP, this is not a sales pitch)

The one I use is PC Robots.txt Plugin

This plugin allows me to set up a virtual robot file that I can view an edit easily from the plugin settings page in my Admin dashboard. 

It comes with a default setup already typed into the robot.rxt file disallowing a large number of known to be bad Bots, specifically giving the Google Bots premission (confusing because it says disallowed but because there is no "/" after the disallow statement it is actually saying allowed) 

It also allows any Bot not included in the disallow:/ list

It will ask the good Bots not to index certain directories and files.

This is an important feature you should look into.

I am going to close.  I this info may be of help.  Below is what my current robot.txt file looks like for one of my sites which was created by the PC Robots.txt plugin and then edited by me to include some parameters that I found when researching articles on the best setups for robot.txt for WP Blogs.

Take Care:  (long list of bad boys, three direct invites to Google, then instructions on what to index at the bottom (very long, compare it to what WP gives as a default, LOL):

User-agent: Alexibot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: asterias
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: BotRightHere
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: hloader
Disallow: /

User-agent: httplib
Disallow: /

User-agent: HTTrack 3.0
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: larbin
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: moget
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: Openfind data gatherer
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Oracle Ultra Search
Disallow: /

User-agent: PerMan
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: RMA
Disallow: /

User-agent: searchpreview
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: spanner
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: TightTwatBot
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: turingos
Disallow: /

User-agent: TurnitinBot/1.5
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: WebCapture 2.0
Disallow: /

User-agent: WebCopier v.2.2
Disallow: /

User-agent: WebCopier v3.2a
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebZIP/4.21
Disallow: /

User-agent: WebZIP/5.0
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget
Disallow: /

User-agent: wget
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: /tag
Disallow: /author
Disallow: /wget/
Disallow: /httpd/
Disallow: /i/
Disallow: /f/
Disallow: /t/
Disallow: /c/
Disallow: /j/
Disallow: /*?

Sitemap: http://YourSiteName/sitemap.xml.gz

Join the Discussion
Write something…
Recent messages
Premium
All this is a foreign language for me. In the court room I've known statements to be disallowed. On the golf course I've had a few strokes disallowed. At least with website stuff that baffles me I always know you are around to lend a hand and for that I'm very greatfull.
Pixote Premium
When you click on that box in the xml sitemap generator that is what caused WP to build the virtual robot.txt. Just uncheck it, PC Robots Plugin will add the sitemap url to the new robot file.

There will be no conflict. WP robot file will not be generated, it will be replaced with the PC Robot file.
Ben G. Premium
Thanks Pixote! I am giving you some gold! I have just one question though. Should I toggle 'off' the "Add sitemap URL to the virtual robots.txt file" in the XML-Sitemap Generator plug-in settings because it says: "The virtual robots.txt generated by WordPress is used. A real robots.txt file must NOT exist in the blog directory!" Does this PC Robots.txt plugin create a real robots.txt file that might conflict with the XML Sitemap Generator plugin?
Top