WWW::Crawler::LWP - A web crawler that uses LWP and HTML::LinkExtor
package My::Crawler; use WWW::Crawler::LWP; my @ISA=qw(WWW::Crawler::LWP);
sub new { my $package=shift @_; my $self=$package->SUPER::new(@_); $self->{UA}->agent("My::Crawler 0.1 (Mozila;1;Linux)"); return $self; }
sub parse { my($self, $page)=@_; my $data=$self->SUPER::parse($page); $data->{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i; return $data; }
sub error { my($self, $page, $response)=@_; print "$page->{uri} wasn't fetched: ".$response->code."\n"; }
sub process { my($self, $page)=@_; print "Doing something to $page->{parsed}{title}\n"; }
sub include { my($self, $uri)=@_; return unless $::LINK eq lc substr($uri, 0, length($::LINK)); return $self->SUPER::include($uri); }
package main;
use vars qw($LINK); $LINK="http://www.yahoo.com/";
my $crawler=My::Crawler->new(); $crawler->schedule_link($LINK); $crawler->run();
WWW::Crawler::LWP is a barebones sub class of WWW::Crawler. It should be subclassed so for each application.
NOTE : it does not respect robots.txt, nor does it restrict it's activity to one server, nor does act kindly on a given server.
Called when an error occurs while fetching an URI. $response
is the HTTP::Response object. Default is do nothing. You should overload
this if you want to report errors somewhere.
Uses HTML::LinkExtor to extract all the links from a page.
Uses LWP::UserAgent to fetch a page. Deals with error conditions and calls
fetched()
if there wasn't an error.
Constructor requires no parameters.
Creates the following members:
Philip Gwyn <perl@pied.nu>
WWW::Crawler, LWP::UserAgent.