About Search Engines and “Spiders”

• How do I prevent search engines from listing my pages?
• How can I make sure the search engines include my pages?

How to prevent the search engines from including your pages in their index

There are three main ways to accomplish this.

Meta tags

Most search engines will honor the “robots” meta-tag. This is an HTML tag that’s inserted into the HEAD section of your Web page. The tag looks like this:

    <meta name="robots" content="noindex,nofollow">

When the search engine sees “noindex”, it doesn’t include the page in its list. “nofollow” tells it not to follow any links to other pages that are on that particular page, either. You can use either directive by itself, or in combination as shown above.

The robots.txt file

The various search engine providers have gotten together, and set up a standard for excluding Web pages from their indexes. This is done by creating a robots.txt file in the root directory of a Web site. When a search engine spider first visits a site, it looks for this file.

The structure of a robots.txt file is pretty simple. It should contain a list of spiders (called “User-agent”), followed by a list of what that spider shouldn’t index. Since in general people aren’t making different rules for different spiders, you can use “*” to indicate all spiders. Comments begin with a pound sign, “#”.

    User-agent: *           # Rules apply to all search engines
    Disallow: /temp         # Exclude anything starting with "/temp"
                            #   (can be directories or files)
    Disallow: /private/     # Exclude the entire directory "/private"
    Disallow: /secret.html  # Don't index the file secret.html

The advantage to using robots.txt is you can make one rule for a whole directory, instead of adding meta tags to each file. Also, it makes it easier to track just what is, and isn’t, being excluded from the search engines. But since it can only be in your Web server’s root directory, you will have to ask the Webmaster to make additions or changes – search engines generally do not honor robots.txt files found anywhere else.

Access control

Directories that are only accessible by certain users cannot be indexed by the search engines, since they can’t get in the directory to read the files.

How to get the search engines to index your pages

In theory, the major search engines will automatically find (and index) your pages if a link to them exists from another page – so you normally shouldn’t have to do anything. On rare occasions, however, you may find you need to specifically tell them about your page.

Each of the search engines has a form you can use for this purpose. You just have to visit the site, and find the link called “add your URL”, “index your pages”, or something similar. The procedures (and URLs) each engine uses change from time to time, so we’re not including links to these pages – but normally you can find links to them from the search engines’ front pages.

About Search Engines and “Spiders”

How to prevent the search engines from including your pages in their index

Meta tags

The robots.txt file

Access control

How to get the search engines to index your pages

Contact Us

Meta

Be boundless

Connect with us: