WWW::Crawler::LWP - A web crawler that uses LWP and HTML::LinkExtor


    package My::Crawler;
    use WWW::Crawler::LWP;
    my @ISA=qw(WWW::Crawler::LWP);

    sub new
        my $package=shift @_;
        my $self=$package->SUPER::new(@_);
        $self->{UA}->agent("My::Crawler 0.1 (Mozila;1;Linux)");
        return $self;

    sub parse 
        my($self, $page)=@_;
        my $data=$self->SUPER::parse($page);
        $data->{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i;
        return $data;

    sub error
        my($self, $page, $response)=@_;
        print "$page->{uri} wasn't fetched: ".$response->code."\n";

    sub process 
        my($self, $page)=@_;
        print "Doing something to $page->{parsed}{title}\n";

    sub include 
        my($self, $uri)=@_;
        return unless $::LINK eq lc substr($uri, 0, length($::LINK));
        return $self->SUPER::include($uri);

    package main;

    use vars qw($LINK);

    my $crawler=My::Crawler->new();


WWW::Crawler::LWP is a barebones sub class of WWW::Crawler. It should be subclassed so for each application.

NOTE : it does not respect robots.txt, nor does it restrict it's activity to one server, nor does act kindly on a given server.


error($self, $page, $response)

Called when an error occurs while fetching an URI. $response is the HTTP::Response object. Default is do nothing. You should overload this if you want to report errors somewhere.

extract_links($self, $page)

Uses HTML::LinkExtor to extract all the links from a page.

fetch($self, $page)

Uses LWP::UserAgent to fetch a page. Deals with error conditions and calls fetched() if there wasn't an error.


Constructor requires no parameters.

Creates the following members:

UA A LWP::UserAgent. You can call methods on the member to get/set any parameters you need, like from(), cookie_jar() and credentials().


Philip Gwyn <>


WWW::Crawler, LWP::UserAgent.