WWW::Crawler::LWP - A web crawler that uses LWP and HTML::LinkExtor
package My::Crawler;
use WWW::Crawler::LWP;
my @ISA=qw(WWW::Crawler::LWP);
sub new
{
my $package=shift @_;
my $self=$package->SUPER::new(@_);
$self->{UA}->agent("My::Crawler 0.1 (Mozila;1;Linux)");
return $self;
}
sub parse
{
my($self, $page)=@_;
my $data=$self->SUPER::parse($page);
$data->{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i;
return $data;
}
sub error
{
my($self, $page, $response)=@_;
print "$page->{uri} wasn't fetched: ".$response->code."\n";
}
sub process
{
my($self, $page)=@_;
print "Doing something to $page->{parsed}{title}\n";
}
sub include
{
my($self, $uri)=@_;
return unless $::LINK eq lc substr($uri, 0, length($::LINK));
return $self->SUPER::include($uri);
}
package main;
use vars qw($LINK);
$LINK="http://www.yahoo.com/";
my $crawler=My::Crawler->new();
$crawler->schedule_link($LINK);
$crawler->run();
WWW::Crawler::LWP is a barebones sub class of WWW::Crawler. It should be subclassed so for each application.
NOTE : it does not respect robots.txt, nor does it restrict it's activity to one server, nor does act kindly on a given server.
Called when an error occurs while fetching an URI. $response
is the HTTP::Response object. Default is do nothing. You should overload
this if you want to report errors somewhere.
Uses HTML::LinkExtor to extract all the links from a page.
Uses LWP::UserAgent to fetch a page. Deals with error conditions and calls
fetched() if there wasn't an error.
Constructor requires no parameters.
Creates the following members:
Philip Gwyn <perl@pied.nu>
WWW::Crawler, LWP::UserAgent.