Pages

Saturday, December 19, 2009

How to decide the robots.txt for Wordpress blogs

Robots.txt is used to define the Robots Exclusion Protocols for the websites. It handles the behaviors of all the robots, bots and web-crawler programs. In a simple words any web-crawler program or bots visiting any website checks for the root file /robots.txt, which defined Exclusion Protocols for that bots. One of the common example of the robots.txt file is defined below
User-agent: *
Disallow: /

Where
User-agent: defines the type of bots and Disallow defined the exclusion for particulars or type of url locations.
User-agent: * means all types of bots
Disallow: / means exclusion for all the files and pages located at websites.

Robots Tips Wordpress How to decide the robots.txt for Wordpress blogs
Robots-Tips-Wordpress

How to decide the robots.txt for Wordpress blogs

A common example of robots.txt file used at honeytechblog is listed as below
Sitemap: http://www.honeytechblog.com/sitemap.xml

User-Agent: *
Disallow: */?mobi*
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /wp-
Disallow: /*.css$
Disallow: */forums/bb-login.php?*
Allow: twitter.honeytechblog.com

User-agent: Googlebot-Image
disallow:

User-agent: Mediapartners-Google*
disallow:

Explanations of the common exclusions and agent used in the robots.txt files

1.Sitemap: http://www.honeytechblog.com/sitemap.xml

Used to define the sitemap location for the bots, these will create ease for search bots to detect your new pages.

2.User-Agent: *

Already described above

3.Disallow: */?mobi*

Used to exclude the pages containing “/?mobi”. I used this feature to avoid the content duplicacy issues generated for mobile users.( It is not necessary for you )

4.Disallow: /wp-admin/

Used to exclude the wordpress admin pages from the search engine. It is necessary to avoid the listing of any hack prone page or errors.

5.Disallow: /wp-includes/

Used to exclude the Wordpress includes folder which also necessary to avoid from the searching bots. It is necessary because some times when your Wordpress faces any plugins or update issues, it floats a serious errors which can be easily indexed by the search bots or hackers.

6.Disallow: /wp-content/

Again it is not necessary to index all the files in the wp-contents.

7.Disallow: /wp-

For security purpose its better hide all the core files and pages.

8.Disallow: /*.css$

For exclusion of all the style-sheets. (If you want to further protect your css files)
Note:
Disallow: /*.”fileextension”$ can be used to exclude the “file extension” from the reach of bots. Where “file extension” can be any extensions you want like *.txt$,*.php$ ,*.jsp$, or *.jpg$

9.Disallow: /*?

Used to disallow all the urls having “?” in it. (Used to avoid content duplicacy issues, tracking urls and custom features from the reach of bots)

10.Disallow: /name/

Used to disallow any directory ,folders or categories. for example you want to disallow “admin” folder then you can simple use “Disallow: /admin/” , if you want to disallow a category named “download” then you can simply use “Disallow:/category/download*” and for uncategorized category use can use “Disallow: /category/uncategorized*”

Extra:

To allow all the images bots (like google image bot) to search and index all images of the website / blog

disallow:
Allow: /*.png$
Allow: /*.jpg$
Allow: /*.gif$
Allow: /*.jpeg$
Allow: /*.jpg$
Allow: /*.ico$
Allow: /images/

To allow all the adsense bot to crawler with ease on entire site

User-agent: Mediapartners-Google*
disallow:

No comments:

Related Posts Plugin for WordPress, Blogger...