WWW::Crawler::RobotRules - Deals with robot rules (robots.txt) for WWW::Crawler
package My::Crawler; use WWW::Crawler::LWP; use WWW::Crawler::RobotRules; use vars qw(@ISA);
# We want enheritance to be as follows # WWW::Cralwer::RobotRules -> WWW::Cralwer::LWP -> WWW::Cralwer # so that fetched() is called properly. YUCK! BEGIN {@WWW::Crawler::RobotRules::ISA=qw(WWW::Crawler::LWP);}
@ISA=qw(WWW:Crawler::RobotRules); sub fetched { my($self, $page)=@_; # only need this if we need to overload fetched()... which an # application rarely wants to anyway return $self->SUPER::fetched($page) if $page->{robots_host}; # .... } sub error { my($self, $page, $error)=@_; warn "BOO HOO! Can't fetch $page->{uri}\n"; $self->SUPER::fetched($page); }
sub include { my($self, $uri)=@_; return unless $::LINK eq lc substr($uri, 0, length($::LINK)); return $self->SUPER::include($uri); }
package main;
use vars qw($LINK); $LINK="http://www.yahoo.com/";
my $crawler=My::Crawler->new(); $crawler->schedule_link($LINK); $crawler->run();
WWW::Crawler::Robot is a subclass of WWW::Crawler that adds robots.txt
processing to your crawler. It ``piggybacks'' on your object hierarchy's
fetch()/fetched()
functions, so you might have to play with
various @ISA
variables to get inheritance just right.
If an error occured while fetching robots.txt, a bogus RobotRules (WWW::Crawler::RobotRules::YesMan) is defined that allows all URIs to be fetched and all pending URIs for that host are then rescheduled. If you want better handling (like drop all URIs for a host if robots.txt is 500 or off the air), you should overload this method. NB: check for $page->{checked_host} to make sure you are dealing with a fetched robots.txt, not with a regular fetch.
Parses robots.txt with WWW::RobotRules, using $self->{UA}->agent() as the name (if it can). Then reschedules any pending URLs for a given page now that we have have valid rules.
If we have rules for a host, the URI is accepted. If not, the URI is put in a pending list and a subrequest is sent for the robots.txt of that host.
Default constructor requires no parameters.
Creates the following members:
Hashref of hosts=>objects that encapsulate a robots.txt for that host.
Note that only allowed()
is called on these objects, so you
don't have to use WWW::RobotRules.
Makes sure we keep going if we have pending URIs.
Reschedules all the pending URIs for $host. Pending URIs are set in
include().
Generally called from fetched()
or
error().
Philip Gwyn <perl@pied.nu>
WWW::Crawler, perl(1).