Robots.txt file can be used to prevent search engines from indexing the parts of your web sites. Some times you don’t want to index sensitive pages of your web pages then robots.txt file can be useful. Also robots.txt file is a good option to avoid duplicate content penalties by search engines.
As name suggest robots.txt is a plain text file. The path of the robots.txt file should be as follows:
http://www.example.com/robots.txt or http://blog.example.com/robots.txt
Always include Robots.txt file in root directory and not in the subdirectory. As when any robot visits your site it will first look for Robots.txt in root directory and if file not present at that location it simply assume that you want to index all your web pages.
Let’s see syntax of Robots.txt file:
If you want to allow all Robots to index all your pages then include this Robots.txt file:
User-agent: *
Disallow:
Here User-agent means all Robots visiting your site to crawl the pages.
If you want to ban all Robots from indexing your site then include this Robots.txt file:
User-agent: *
Disallow: /
To ban specific Robot from indexing pages, include code like:
User-agent: Googlebot
Disallow: /
To ban some web pages like /Category directory with all sub pages:
User-agent: *
Disallow: /Category/
To allow only specific Robots to index your pages:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
Here you can make one mistake. Do not add specific robot ban after Disallow:/ i.e
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
Will ban all Robots including Googlebot from indexing your web site.
Many sites have duplicate content penalties. If same content is accessible from two or more different url’s then it’s said to be duplicate content. If you have /category or /Archive directories then you have chance of having duplicate content penalty. Either show posts excerpt on Category and Archive pages or use robots.txt to ban indexing these pages. Then your Robots.txt file will be look like this:
User-agent: *
Disallow: /category/
Disallow: /archives/
See the complete list of Robots and more guide on Robots files. And tools to validate Robots.txt file.
Keep in mind that it’s not mandatory to all Robots to follow the robots.txt file. Don’t rely on Robots.txt file if you have very sensitive data and don’t want to get that indexed by any means. Use other ways like password protected files.
Also banning most part of your web site is not a good SEO idea. Search engines will not visit the sites having most part banned from indexing.
One last important thing, if you are using sitemap.xml file to submit your sitemap to search engines then make sure that all the url’s submitted in sitemap.xml file are crawlable and you have not accidentally banned any url in Robots.txt file.
It’s always a good idea to use Google webmater tools to validate your Robots.txt file. Here you can check whether specific url’s are allowed or banned to Robots. (Google has added features in Robots.txt tool to report syntax errors. Also you can include your sitemap.xml file in robots.txt file)
Great article. I’ll definitely be back. All the best, Leonel
Thank you for this post, it was very informative, even two years later. I’m interested to know about the effect of disallow. Do you know if the robots will stop coming back to check on your url if it finds a robots.txt like:
User-agent: *
Disallow: /
I am trying to come up with a beta release strategy. I’d like to prevent indexing during the private beta phase, then turn indexing on when the site goes public. I’d love to hear your thoughts on this matter.
If we just add this code in our robots.txt file, then it is ok or not. (For WordPress)
User-agent: *
Disallow: /
OR we have to add some more tags