April 27, 2006

Robots, Spiders, Crawlers…What the?

You will find other articles relevant to this document in these sections:
Richard Lee @ 11:44 am

Sometimes referred to as ‘Spiders’ or ‘Web Crawlers’, Web Robots are automated programs which traverse the Web’s hypertext structure, retrieving a document then, recursively retrieving all documents referenced within.

In terms of web development, where most concerned with Indexing Robots, but there are plenty of proprietary and non-proprietary robots out there (see “what kind of robots are there?” - robotstxt.org) in fact there’s nothing stopping you developing your own.

Indexing Robots, such as the Googlebot, index web pages by using the following methods; reading HTML titles, reading the first few paragraphs within a page, parsing the entire HTML contents and weighting keywords, or reading META tags - FYI it has been indicated that very few robots index MEAT tags effectively.

Ultimately a Search Engine will search through the databases of HTML documents indexed by it’s robot to recall a list of relevant web pages based on a user query.

Why might Web Robots be considered bad?

Understandably the directory operation provided by Indexing Robots is an important service. But there a number of reasons why a developer may consider Web Robots evil:

Certain robot implementations can (and have in the past) overloaded networks and servers

As Robots are largely self-sufficient, a badly written robot can repeatedly cause errors many times over before it is noticed, possibly consuming many resources as well.

Robots are operated by humans, who make mistakes in configuration, or simply don’t consider the implications of their actions.

An example may be the indexing of copyrighted, explicit and otherwise protected content on a website - Google has recently been involved in a number of law suits over this.

Web-wide indexing robots build a central database of documents, which doesn’t scale too well to millions of documents on millions of sites.

Obviously there is an upper limit to how much can be indexed, which is why Indexing Robots such as the Googlebot employ very intelligent algorithms which decide on whether a particular page should be indexed or not - However these algorithms are kept in secret, and getting a high google ranking for example is more experience and luck than an exact science!

So how can you prevent bots from indexing your site?

During 1993 and 1994 there were a number of reports of robots crawling WWW servers when they weren’t welcome for various reasons. In some cases these reports were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). These incidents indicated the need for a standard which would allow WWW servers to indicate to robots which parts of their server should not be accessed. This standard essentially came in the form of a plaintext file ‘robots.txt’, which specified an access policy for robots visiting the site.

How to use robots.txt

NB: Sometimes it may not be possible for you to create a robots.txt file because you do not administer the server, there are other code alternatives see below

The format and semantics of the “/robots.txt” file (as of writing this post):

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form “:“. The field name is case insensitive.

Comments can be included in file using UNIX bourne shell conventions: the ‘#‘ character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely and therefore do not indicate a record boundary.

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent

The value of this field is the name of the robot the record is describing access policy for. If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is ‘*‘, the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the “/robots.txt” file.

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

The presence of an empty “/robots.txt” file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

Examples:

The following example “/robots.txt” file specifies that no robots should visit any URL starting with “/cyberworld/map/” or “/tmp/”, or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

This example “/robots.txt” file specifies that no robots should visit any URL starting with “/cyberworld/map/”, except the robot called “cybermapper”:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
 
# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:

This example indicates that no robots should visit this site further:

# go away
User-agent: *
Disallow: /

robots.txt Alternatives

All is not lost: there is a standard for using HTML META tags to keep robots out of your documents.

The basic idea is that if you include a tag like:

<meta name="ROBOTS" content="NOINDEX" />

in your HTML document, that document won’t be indexed.

If you do:

<meta name="ROBOTS" content="NOFOLLOW" />

the links in that document will not be parsed by the robot.

Final thoughts

Just remember what you’re disallowing. Listing sensitive files within your robots.txt file is not a good idea, as an ill-willed robot may still traverse the relative file or directory, or worse a malicious user could use your robots.txt to find where your sensitive files are! A better solution to prevent robots from crawling sensitive areas is to create a ‘/norobots/’ directory, make it un-listable on your server, THEN just add the ‘/norobots/’ directory name to your robots.txt

User-Agent: *
Disallow: /norobots/
Share and Enjoy:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • YahooMyWeb

1 Comment »

  1. Careful not to list too specifically about private bits of information, such as Disallow: /myprivate/stuff.html or /newsletter/subscribers.csv as a hacker will be able to see your robots.txt and may hunt it for information about your site - not that you would leave something as open as that, but just remember to use .htaccess to make areas of your site forbidden. Just be conscious of the information you are making publicly available about your site.

    Comment by Cameron Manderson — April 27, 2006 @ 12:39 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment