NAME

WWW::Crawler::RobotRules - Deals with robot rules (robots.txt) for WWW::Crawler


SYNOPSIS

    package My::Crawler;
    use WWW::Crawler::LWP;
    use WWW::Crawler::RobotRules;
    use vars qw(@ISA);

    # We want enheritance to be as follows 
    # WWW::Cralwer::RobotRules -> WWW::Cralwer::LWP -> WWW::Cralwer
    # so that fetched() is called properly.  YUCK!
    BEGIN {@WWW::Crawler::RobotRules::ISA=qw(WWW::Crawler::LWP);}

    @ISA=qw(WWW:Crawler::RobotRules);
    
    sub fetched 
    {
        my($self, $page)=@_;
        # only need this if we need to overload fetched()... which an
        # application rarely wants to anyway
        return $self->SUPER::fetched($page) if $page->{robots_host};
        # ....
    }
    sub error
    {
        my($self, $page, $error)=@_;
        warn "BOO HOO!  Can't fetch $page->{uri}\n";
        $self->SUPER::fetched($page);
    }

    sub include 
    {
        my($self, $uri)=@_;
        return unless $::LINK eq lc substr($uri, 0, length($::LINK));
        return $self->SUPER::include($uri);
    }

    package main;

    use vars qw($LINK);
    $LINK="http://www.yahoo.com/";;

    my $crawler=My::Crawler->new();
    $crawler->schedule_link($LINK);
    $crawler->run();


DESCRIPTION

WWW::Crawler::Robot is a subclass of WWW::Crawler that adds robots.txt processing to your crawler. It ``piggybacks'' on your object hierarchy's fetch()/fetched() functions, so you might have to play with various @ISA variables to get inheritance just right.


METHODS


error($self, $page, $error)

If an error occured while fetching robots.txt, a bogus RobotRules (WWW::Crawler::RobotRules::YesMan) is defined that allows all URIs to be fetched and all pending URIs for that host are then rescheduled. If you want better handling (like drop all URIs for a host if robots.txt is 500 or off the air), you should overload this method. NB: check for $page->{checked_host} to make sure you are dealing with a fetched robots.txt, not with a regular fetch.


fetched($self, $page)

Parses robots.txt with WWW::RobotRules, using $self->{UA}->agent() as the name (if it can). Then reschedules any pending URLs for a given page now that we have have valid rules.


include($self, $uri)

If we have rules for a host, the URI is accepted. If not, the URI is put in a pending list and a subrequest is sent for the robots.txt of that host.


new($package)

Default constructor requires no parameters.

Creates the following members:

PENDING Hashref of host=>arrayref of URIs on a given host.
RULES

Hashref of hosts=>objects that encapsulate a robots.txt for that host. Note that only allowed() is called on these objects, so you don't have to use WWW::RobotRules.


next_link

Makes sure we keep going if we have pending URIs.


schedule_pending($self, $host)

Reschedules all the pending URIs for $host. Pending URIs are set in include(). Generally called from fetched() or error().


AUTHOR

Philip Gwyn <perl@pied.nu>


SEE ALSO

WWW::Crawler, perl(1).