Monday 29 August 2011

All you want to know about Robots.txt


Remove your website Searh Engines

If you wish to exclude your website from Search Engine's index, you can place a file at the root of your server called robots.txt. This is the standard protocol that most web crawlers observe for excluding a web server or directory from an index. Please note that Googlebot does not interpret a 401/403 response ("Unauthorized"/"Forbidden") to a robots.txt fetch as a request not to crawl any pages on the site.
To remove your site from search engines and prevent all robots from crawling it in the future, place the following robots.txt file in your server root:
User-agent: *
Disallow: /
To remove your site from Search Engine only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:

User-agent: Googlebot
Disallow: /
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
For the https protocol (
https://yourserver.com/robots.txt):
User-agent: *
Disallow: /


Search Engine will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of your site from the Search Engine index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)

Remove your website Searh EnginesRemove part of your website

Option 1: Robots.txt
To remove directories or individual pages of your website, you can place a robots.txt file at the root of your server. For information on how to create a robots.txt file, see the The Robot Exclusion Standard. When creating your robots.txt file, please keep the following in mind: When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*". Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name.
To remove all pages under a particular directory (for example, lemurs), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /lemurs
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:

User-agent: Googlebot
Disallow: /*.gif$
To remove dynamically generated pages, you'd use this robots.txt entry:

User-agent: Googlebot
Disallow: /*?
Option 2: Meta tags

Another standard, which can be more convenient for page-by-page use, involves adding a <META> tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow other robots to index the page on your site, preventing only Search Engine's robots from indexing the page, you'd use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
Note: If you believe your request is urgent and cannot wait until the next time Search Engine crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster must first insert the appropriate meta tags into the page's HTML code. Doing this and submitting via the automatic URL removal system will cause a temporary, 180-day removal of these pages from the Search Engine index, regardless of whether you remove the robots.txt file or meta tags after processing your request.

Remove an image from Google's Image Search

To remove an image from Google's image index, add a robots.txt file to the root of the server. (If you can't put it in the server root, you can put it at directory level.)
Example: If you want Google to exclude the dogs.jpg image that appears on your site at www.yoursite.com/images/dogs.jpg, create a page at www.yoursite.com/robots.txt and add the following text:
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
To remove all the images on your site from our index, place the following robots.txt file in your server root:

User-agent: Googlebot-Image
Disallow: /
This is the standard protocol that most web crawlers observe for excluding a web server or directory from an index. More information on robots.txt is available here:
http://www.robotstxt.org/wc/norobots.html.
Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry:
User-agent: Googlebot-Image
Disallow: /*.gif$



No comments:

Post a Comment