Phpnuke Database - Controlling msn bot

You are here: Index

Controlling msn bot

The MSN Search web crawler MSNBot enables website owners to control which pages MSN Search indexes and how often MSNBot accesses your website.

You can prevent (the pain in the butt) MSNBot and other standards-compliant crawlers from crawling a

Server or collecting information and links from specific pages on your website by using a robots.txt file and/or meta tags.
This is a bit popular cause not everybody is that happy with the annoying bot.

Note
If other sites link to your site, your site's

URL and any text you include in HTML anchor tags may still be added to our index.
However, your site content is not added to the index.
Use the robots.txt file to control access to your website or part of the server .

To control how and when your website is crawled, create a robots.txt file in the top-level (root) directory of your website.
In the robots.txt file, you can specify which web crawlers to allow or block.
Note that while MSNBot complies with the standards for robots.txt, not all web crawlers comply.

To conform to the Robots Exclusion Standard, MSNBot searches for robots.txt.
When you create the file, make sure that the file is named robots.txt.
Crawling and indexing restrictions may not work correctly if you name the file robot.txt.

Each time MSNBot crawls your website, it looks in your

Web Server's root directory for a robots.txt file.
If the file exists, MSNBot checks to see if MSNBot is an allowed user

Agent, and if any crawling or indexing restrictions have been set.

To set which web crawlers can access your website, use the syntax in the table below for your robots.txt file. MSN Search also includes image searching provided by Picsearch.
If you do not want your images indexed, you can block the Picsearch crawler, Psbot, as described in the following table.

Each time MSNBot crawls your website, it looks in your web server's root directory for a robots.txt file.
If the file exists, MSNBot checks to see if MSNBot is an allowed user agent, and if any crawling or indexing restrictions have been set.

To set which web crawlers can access your website, use the syntax in the table below for your robots.txt file.
MSN Search also includes image searching provided by Picsearch.
If you do not want your images indexed, you can block the Picsearch crawler, Psbot, as described in the following table.

Text strings in the robots.txt file are not case-sensitive.

To do this:	Use this syntax:
Allow all robots full access and to prevent "file not found: robots.txt" errors	Create an empty robots.txt file
Allow all robots complete access	User-agent: * Disallow:
Allow only MSNBot access	User-agent: msnbot Disallow: User-agent: * Disallow: /
Exclude all robots from the entire server	User-agent: * Disallow: /
Exclude only MSNBot	User-agent: msnbot Disallow: /
Exclude only Psbot (Picsearch)	User-agent: psbot Disallow: /

Restrict indexing and link crawling within your website

You can block MSNBot from crawling specific file types linked to your website by specifying MSNBot as the user-agent for a Disallow tag that specifies the file types to exclude.

To do this:	Use this syntax:	Examples
Restrict MSNBot from indexing specific file types	User-agent: msnbot Disallow: /*.[file extension]$ (the "$" is required)	User-agent: msnbot Disallow: /.PDF$ Disallow: /.jpeg$ Disallow: /*.exe$

Use metadata tags to control page indexing and link crawling

You can allow MSNBot to crawl your website and still restrict access to specific web pages and documents by using the noindex and nofollow meta tags within the page code.
The noindex tag allows the Web Page to be retrieved by MSNBot, but blocks indexing of its content.
The nofollow tag blocks the web crawler from following links in the web page that go to other web pages or documents.
Note that not all web crawling robots obey these tags.

If you want to set access and indexing restrictions for your website, replace the user-agent name robotswith msnbotor "*".msnbot in the tag syntax examples below. You can use each tag alone or combine both tags into a single meta tag.

To do this:	Add this to the page header:
Restrict MSNBot from indexing a page	<META NAME="msnbot" CONTENT="noindex" />
Restrict all robots from indexing a page	<META NAME="*" CONTENT="noindex" />
Restrict MSNBot from following links on a page	<META NAME="msnbot" CONTENT="nofollow" />
Restrict all robots from following links on a page	<META NAME="robots" CONTENT="nofollow" />
Block MSNBot from both indexing and following links	<META NAME="msnbot" CONTENT="noindex,nofollow" />
Prevent MSNBot from caching a page	<META NAME="msnbot" CONTENT="nocache" /> or <META NAME="msnbot" CONTENT="noarchive" />

Limit crawl frequency

If you occasionally get high traffic from MSNBot, you can specify a crawl delay parameter in the robots.txt file to specify how often, in seconds, MSNBot can access your website.
To do this, add this syntax to your robots.txt file: