WWW::Crawler - Unified framework for web crawlers
package My::Crawler; use WWW::Crawler;
sub fetch { my($self, $page)=@_; $page->{document}=get($page->{uri}); $self->fetched($page); } sub parse { my($self, $page)=@_; my %data; $data{links}=[$page->{document} =~ m(href="(.+?)")ig]; $data{title}=$1 if $page->{document} =~ m(<title>(.+?)</title>)i; return \%data; } sub extract_links { my($self, $page)=@_; return @{$page->{parsed}{links}}; }
sub process { my($self, $page)=@_; print "Doing something to $page->{parsed}{title}\n"; }
package main; my $crawler=My::Crawler->new();
$crawler->sechdule_link("http://www.yahoo.com/"); crawler->run;
Obviously, this example is very bad. It will doesn't respect robots.txt, nor does it check to make sure you are only crawling one host or anything. Running it would be very bad.
WWW::Crawler is intented as a unified framework for web crawlers. It should be subclassed so for each application.
Turns an URI into it's cannonical form. Known host equivalents (localhost is the same as localhost.localdomain, or www.slashdot.org and slashdot.org are the same) should be dealt with here.
The default method simply removes internal anchors (page.html#foo is in fact page.html) and URI parameters (page.html?foo=bar is in fact page.html).
Called when an error occurs while fetching an URI. $error
is
whatever fetch()
sets it to. Default is do nothing. You should
overload this if you want to report errors somewhere. Having a generalised
error mechanism like this allows things like WWW::Crawler::RobotsRules to
cooperate with various fetch()
routines cleanly.
Returns a array of absolute URIs of all the links contained in a given
page. URIs should be in full form (ie http://server.com/yadda/yadda/yadda.html)
or URI objects. Use $page->{uri} as a base URI for relative links. We
can't do this in cannonical(),
because it doesn't know the
base URI a link was extracted from.
Must be overloaded.
Should fetch the requested URI ($page->{uri}), set $page->{header} (if applicable and needed) and $page->{document} then call $self->fetched($page). If there was an error you should call $self->fetched($page, {...something to do with the error}).
Must be overloaded.
This is where the document is processed, links are extracted and so on. Page must contain the following members : document and uri.
Returns true if the $uri should be scheduled.
Constructor. Overload as needed. Please call SUPER::new() as well if you are using the default schedule_link/next_link/include, because they need package members.
Default constructor requires no parameters.
Creates the following members:
Arrayref FIFO URIs that need to be processed.
Returns an URI that should be fetched and processed. Returns an empty
string means no more URIs are known, but we still want to keep going.
Return undef()
means all the work has been done and now we go
home.
Parses an HTML document ($page->{document}) and sets various members of
$page
to be used later by process()
and/or
extract_links().
This is where an application does it's own work. All members of
$page
should be set.
Must be overloaded.
Main processing loop. Does not exit until next_link()
returns
undef().
Overload this method to fit it into your own event loop.
Add $uri to the todo list. Must cooperate with
next_link()
and add_link()
to get their job done.
If you wanted to go easy on a servers bandwidth, this is where you'd put
the logic. Something like :
sub schedule_link { my($self, $uri)=@_; $uri=$self->cannonical($uri); return unless $self->include($uri); my $host=URI->new($uri)->host(); $self->{SERVERS_TIME}{$host}||=time; push @{$self->{SERVERS}{$host}}, $uri; }
sub next_link { my($self)=@_; my $now=time; foreach my $host (grep {$self->{SERVERS_TIME}{$_}} <= $now} keys %{$self->{SERVERS_TIME}}) { if(@{$self->{SERVERS}{$host}}) { push @{$self->{TODO}}, shift @{$self->{SERVERS}{$host}}; $self->{SERVERS_TIME}{$_}=$now+1; } else { delete $self->{SERVERS}{$host}; $self->{SERVERS_TIME}{$host}; } } my $next=shift @{$self->{TODO}}; return '' if not $next and keys %{$self->{SERVERS_TIME}}; return $next; }
seen()
is called for each URI that is being processed. This
method should cooperate with include()
to avoid fetching the
same URI twice.
$page
is a hashref that is passed to many routines. It
contains various information about a given page.
URI of the page. Set by run()
HTTP header. Set by parser()
and/or fetch().
Document contents. Set by fetched()
Contains the data returned by parsed().
The following methods must be overloaded: fetch(),
process(),
extract_links().
Object members should be created in new()
and documented in
the POD.
So am I!
Anyway, here is a psuedo-code version of what is going on:
schedule_link(with a given URI) # prime the pump
run { while(next_link() returns defined) { fetch($link) { fetched($page now has document (and maybe header)) parse($page) process($page) seen($page) foreach (extract_links($page)) { schedule_link($new_link) { cannonical($new_link) if(include($link)) { # add to the todo list, so next_link() sees it } } } } } }
Philip Gwyn <perl@pied.nu>
WWW::Crawler::RobotsRules, WWW::Crawler::Slower,
WWW::Crawler::LWP, WWW::Crawler::POE, perl(1).