NAME

WWW::Crawler::LWP - A web crawler that uses LWP and HTML::LinkExtor


SYNOPSIS

    package My::Crawler;
    
    use WWW::Crawler::LWP;
    my @ISA=qw(WWW::Crawler::LWP);

    sub new
    {
        my $package=shift @_;
        my $self=$package->SUPER::new(@_);
        $self->{UA}->agent("My::Crawler 0.1 (Mozila;1;Linux)");
        return $self;
    }

    sub parse 
    {
        my($self, $page)=@_;
        my $data=$self->SUPER::parse($page);
        $data->{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i;
        return $data;
    }

    sub error
    {
        my($self, $page, $response)=@_;
        print "$page->{uri} wasn't fetched: ".$response->code."\n";
    }

    sub process 
    {
        my($self, $page)=@_;
        print "Doing something to $page->{parsed}{title}\n";
    }

    sub include 
    {
        my($self, $uri)=@_;
        return unless $::LINK eq lc substr($uri, 0, length($::LINK));
        return $self->SUPER::include($uri);
    }

    package main;

    use vars qw($LINK);
    $LINK="http://www.yahoo.com/";;

    my $crawler=My::Crawler->new();
    $crawler->schedule_link($LINK);
    $crawler->run();


DESCRIPTION

WWW::Crawler::LWP is a barebones sub class of WWW::Crawler. It should be subclassed so for each application.

NOTE : it does not respect robots.txt, nor does it restrict it's activity to one server, nor does act kindly on a given server.


METHODS


error($self, $page, $response)

Called when an error occurs while fetching an URI. $response is the HTTP::Response object. Default is do nothing. You should overload this if you want to report errors somewhere.


extract_links($self, $page)

Uses HTML::LinkExtor to extract all the links from a page.


fetch($self, $page)

Uses LWP::UserAgent to fetch a page. Deals with error conditions and calls fetched() if there wasn't an error.


new($package)

Constructor requires no parameters.

Creates the following members:

UA A LWP::UserAgent. You can call methods on the member to get/set any parameters you need, like from(), cookie_jar() and credentials().


AUTHOR

Philip Gwyn <perl@pied.nu>


SEE ALSO

WWW::Crawler, LWP::UserAgent.