NAME

WWW::Crawler - Unified framework for web crawlers


SYNOPSIS

    package My::Crawler;
    use WWW::Crawler;

    sub fetch
    {
        my($self, $page)=@_;
        $page->{document}=get($page->{uri});
        $self->fetched($page);
    }
    
    sub parse
    {
        my($self, $page)=@_;
        my %data;
        $data{links}=[$page->{document} =~ m(href="(.+?)")ig];
        $data{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i;
        return \%data;
    }
    
    sub extract_links
    {
        my($self, $page)=@_;
        return @{$page->{parsed}{links}};
    }

    sub process
    {
        my($self, $page)=@_;
        print "Doing something to $page->{parsed}{title}\n";
    }

    package main;
    
    my $crawler=My::Crawler->new();

    $crawler->sechdule_link("http://www.yahoo.com/";); crawler->run;

Obviously, this example is very bad. It will doesn't respect robots.txt, nor does it check to make sure you are only crawling one host or anything. Running it would be very bad.


DESCRIPTION

WWW::Crawler is intented as a unified framework for web crawlers. It should be subclassed so for each application.


METHODS


cannonical

Turns an URI into it's cannonical form. Known host equivalents (localhost is the same as localhost.localdomain, or www.slashdot.org and slashdot.org are the same) should be dealt with here.

The default method simply removes internal anchors (page.html#foo is in fact page.html) and URI parameters (page.html?foo=bar is in fact page.html).


error($self, $page, $error)

Called when an error occurs while fetching an URI. $error is whatever fetch() sets it to. Default is do nothing. You should overload this if you want to report errors somewhere. Having a generalised error mechanism like this allows things like WWW::Crawler::RobotsRules to cooperate with various fetch() routines cleanly.


extract_links($self, $page)

Returns a array of absolute URIs of all the links contained in a given page. URIs should be in full form (ie http://server.com/yadda/yadda/yadda.html) or URI objects. Use $page->{uri} as a base URI for relative links. We can't do this in cannonical(), because it doesn't know the base URI a link was extracted from.

Must be overloaded.


fetch($self, $page)

Should fetch the requested URI ($page->{uri}), set $page->{header} (if applicable and needed) and $page->{document} then call $self->fetched($page). If there was an error you should call $self->fetched($page, {...something to do with the error}).

Must be overloaded.


fetched($self, $page)

This is where the document is processed, links are extracted and so on. Page must contain the following members : document and uri.


include($self, $uri)

Returns true if the $uri should be scheduled.


new($package)

Constructor. Overload as needed. Please call SUPER::new() as well if you are using the default schedule_link/next_link/include, because they need package members.

Default constructor requires no parameters.

Creates the following members:

ALREADY Hashref of URIs that have already been visited.
TODO

Arrayref FIFO URIs that need to be processed.


next_link($self)

Returns an URI that should be fetched and processed. Returns an empty string means no more URIs are known, but we still want to keep going. Return undef() means all the work has been done and now we go home.


parse($self, $page)

Parses an HTML document ($page->{document}) and sets various members of $page to be used later by process() and/or extract_links().


process($self, $page)

This is where an application does it's own work. All members of $page should be set.

Must be overloaded.


run($self)

Main processing loop. Does not exit until next_link() returns undef().

Overload this method to fit it into your own event loop.


schedule_link($self, $uri)

Add $uri to the todo list. Must cooperate with next_link() and add_link() to get their job done. If you wanted to go easy on a servers bandwidth, this is where you'd put the logic. Something like :

    sub schedule_link
    {
        my($self, $uri)=@_;
        
        $uri=$self->cannonical($uri);
        return unless $self->include($uri);
        my $host=URI->new($uri)->host();
        
        $self->{SERVERS_TIME}{$host}||=time;
        push @{$self->{SERVERS}{$host}}, $uri;
    }

    sub next_link
    {
        my($self)=@_;
        my $now=time;
        foreach my $host (grep {$self->{SERVERS_TIME}{$_}} <= $now}
                                keys %{$self->{SERVERS_TIME}}) {
                                
            if(@{$self->{SERVERS}{$host}}) {
                push @{$self->{TODO}}, shift @{$self->{SERVERS}{$host}};
                $self->{SERVERS_TIME}{$_}=$now+1;
            } else {
                delete $self->{SERVERS}{$host};
                $self->{SERVERS_TIME}{$host};
            }
        }
        my $next=shift @{$self->{TODO}};
   
        return '' if not $next and keys %{$self->{SERVERS_TIME}};
        return $next;
    }


seen($self, $uri)

seen() is called for each URI that is being processed. This method should cooperate with include() to avoid fetching the same URI twice.


$page

$page is a hashref that is passed to many routines. It contains various information about a given page.

uri

URI of the page. Set by run()

header

HTTP header. Set by parser() and/or fetch().

content

Document contents. Set by fetched()

parsed

Contains the data returned by parsed().


OVERLOADING

The following methods must be overloaded: fetch(), process(), extract_links().

Object members should be created in new() and documented in the POD.


Woah, i'm confused.

So am I!

Anyway, here is a psuedo-code version of what is going on:

    schedule_link(with a given URI) # prime the pump

    run {
        while(next_link() returns defined) {
            fetch($link) {
                fetched($page now has document (and maybe header))
                parse($page)
                process($page)
                seen($page)
                foreach (extract_links($page)) {
                    schedule_link($new_link) {
                        cannonical($new_link)
                        if(include($link)) {
                            # add to the todo list, so next_link() sees it
                        }     
                    }
                }
            }
        }
    }


AUTHOR

Philip Gwyn <perl@pied.nu>


SEE ALSO

WWW::Crawler::RobotsRules, WWW::Crawler::Slower, WWW::Crawler::LWP, WWW::Crawler::POE, perl(1).