New Webmasters > How to Prevent Pages Being Indexed by Search Engines

How to Prevent Pages Being Indexed by Search Engines

Keep those robots at bay

Keep those robots at bay

As much as webmasters are desperate for search engine spiders to crawl their pages, there are often times when you want a page or section to be ignored. They might be non-content pages such as terms and conditions or a privacy policy. They could be test pages or just private pages that you don’t want to share.

This article will show you all the different ways to keep search engines away from your content. The most appropriate method to use for your own situation is up to you.

Robots.txt

The standard way of preventing search engines accessing a page is to block it using a robots.txt file. It goes in the the base directory of your website.

You use the following syntax:

User-agent: *
Disallow: /sessions/
Disallow: /cgi-bin/

The * under User-agent means that the rules apply to all spiders. The Disallow rule lists the directories that should be excluded.

You can do other things with robots.txt:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

These rules in this example mean that Googlebot is allowed to access the entire site (note how there is nothing after Disallow). Every other user-agent is blocked. There are more advanced examples of how to build a robots.txt file at robotstxt.org.

There are two important things to remember about using robots.txt. The first is that a spider can ignore it. This is particularly true for robots harvesting email addresses or scraping your content. The second is that anybody can read the file. Don’t use robots.txt to hide secret information, as a quick glance at the file will tell you where all those secret directories are.

You should also be aware that even though search engines won’t spider the site, the URL can still appear in search engine result lists, especially if the page is linked to from another site.

META Tags

You can use meta tags to control how spiders index specific pages. They look something like this:

<meta name="robots" content="noindex,nofollow" />

The options that you have concerning robots are:

  • index” – which tells the robot to index that page
  • noindex” – tells the robot to not index that page
  • follow” – the robot should follow all the links on that page
  • nofollow” – the robot should not spider any links from that page

The default is “index,follow” so if you want your page spidered and the links followed then you can omit the tag.

When using a meta tag to “noindex” a page you will not have the problem of the URL appearing in the search engine results. It will be completely omitted. Remember that a page that is not indexed will still be crawled and unless you have also specified “nofollow” in your tag, the links on that page will be spidered too.

Nofollow Links

You can use “nofollow” on a specific hyperlink to stop search engine spiders from following that link and crawling the page. It is supported by Google, Yahoo! and MSN/Live You use the following syntax:

<a href="http://example.com/page.html" rel="nofollow">Example.com</a>

This syntax was originally created due to the problem of comment spam on guestbooks and blogs. The original idea was that a site could be linked to without passing any PageRank or link influence.

While this method is useful to prevent a spider following a link to a page, remember that it only affects the specific hyperlink it is attached to. If another link without nofollow points to the same page then that link will be followed and the page will still be indexed.

Password protected pages

A common way to keep private pages secure is to protect them with a password. Since the page cannot be accessed without the proper credentials, search engine spiders cannot index the page. As with some of the other examples, the URL may still appear in search engine results if a link to the page is found.

The following code shows you how to protect a page with a username and password using PHP. It uses basic HTTP Authentication:

<?php
if (!isset($_SERVER['PHP_AUTH_USER'])) {
    header('WWW-Authenticate: Basic realm="My Realm"');
    header('HTTP/1.0 401 Unauthorized');
    echo 'Text to send if user hits Cancel button';
    exit;
} else {
    echo "Hello {$_SERVER['PHP_AUTH_USER']}.";
    echo "You entered {$_SERVER['PHP_AUTH_PW']} as your password.";
}
?>

To read more about authentication in PHP visit php.net

Offline Pages

It may seem simple but keeping your web pages offline is the perfect way to prevent them being indexed. Maybe they can be put on an intranet rather than fully online. If collaboration or viewing over the Internet is essential then this method isn’t suitable, but on many occasions it is.

Remove URLs From Search Index

Google and Yahoo! both have tools that allow you to remove specific URLs from their search index relatively quickly.

Google Webmaster Tools

If you have registered with Google Webmaster Tools then you can use this to remove a URL.

  1. From the dashboard click the website that contains the URL.
  2. Then click “Tools”, followed by “Remove URLs.” Google will give you some tips about getting your URL removed and keeping it like that.
  3. The URL you want to remove should be blocked from further spidering. If you don’t ensure this then the page will simply be indexed again at some point in the future. Once you are certain this is the case, click “New Removal Request.”
  4. You then choose what you want to remove. You can remove and individual URL, an entire directory, a whole site or the Google cached copy of a page. Make your selection and click “Next.”
  5. Enter the URL to be removed and click “Submit Removal Request.”

Yahoo! Site Explorer

To use Yahoo! Site Explorer you need to verify your site.

  1. From the Site Explorer homepage either search for the URL you want to remove or click “Explore” and locate it manually.
  2. Click [Delete URL/Path] next to the URL you wish to remove. Any URLs below that particular folder will also be deleted.
  3. You will be presented with a confirmation page which will list all pages to be removed. You can edit this list to keep certain URLs. Click “Update” to create a new list.
  4. Click “Yes” to confirm your deletions.

A note on removing URLs with webmaster tools. Using Google and Yahoo’s webmaster tools requires you to verify your site. The deletions will only stay in effect as long as your site is verified. If at any point your site becomes unverified, your deleted URLs will return to the index.

Another important note. When you remove a URL, make sure it is blocked from further crawling via one of the other methods listed above. Otherwise your deleted URLs will be reindexed soon after they are removed.

Share this page with others
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Furl
  • Reddit
  • blogmarks
  • Propeller

Related Articles

Discussion

5 comments for “How to Prevent Pages Being Indexed by Search Engines”

  1. As a webmaster, you definitely should use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.

    I wrote more about this here:

    Webmaster Tips: Blocking Selected User-Agents
    http://faseidl.com/public/item/213126

    Posted by F. Andy Seidl | September 20, 2008, 10:12 pm
  2. Thanks! This helped me a lot. I just needed one page to not get indexed, but I needed my visitors to see it – so I opted for the ‘nofollow’ method.

    The robots.txt thing is really confusing – although, everywhere else it seems to be the only option? It seems pretty useless if the spiders can still get through?

    When you are talking about the user-agent in the robot.txt file, you mention putting it the websites main directory – wouldn’t that make the entire website prone to not getting indexed? How would you use that method for a specific webpage?

    For example, I want ‘domain.com/love.html’indexed, but I don’t want ‘domain.com/love/me.html’ indexed … where would the robot.txt go? In the ‘Love’ Directory?

    Thanks again!

    Posted by Andrew | February 18, 2010, 6:08 pm
  3. Spiders can still get through as it is only designed for honest spiders such as search engines.

    There should only be a single robots.txt file in the root directory.

    By default, domain.com/love.html will be indexed, but if you want domain.com/love/me.html not to be indexed you can use:

    User-agent: *
    Disallow: /love/me.html

    Posted by corbyboy | February 18, 2010, 6:55 pm
  4. Trackbacks / Pingbacks

  5. Websites tagged "indexed" on Postsaver | September 11, 2008, 12:47 am
  6. Kil­pai­lut ja arvon­nat – Link­kejä | Arpanappula | August 16, 2010, 4:18 pm

Post a comment