Updated: 2007/01/11
Onderstaand stukje code is een toepassing van mijn laatste versie van de
walker module. Walker.pm is een objectgeoriënteerd schema voor een webrobot in perl. Hieronder een korte manual voor het gebruik van de module (engels).
- Walker is still under development -
SYNOPSIS
use strict;
use walker;
my $robot = walker->
new(URL =>
'http://www.example.com/page/', MAX_REDIRECT =>
1);
$robot->
agent('Mozilla/5.0');
$robot->
walk();
die "Error\n" if ($robot->
is_error);
#Get the contentmy $html =
$robot->
get_contents;
#Fetch next site$robot->
site('http://www.example.com/nextpage/');
$robot->
walk();
#Get the last-modified datemy $date =
$robot->
header('last-modified');
#Print all the header fieldsmy $href =
$robot->
header;
foreach my $field (keys %
{$href}) { print '[Field] ',
$field,
': ',
$href->
{$field},
"\n";
}
WALKER.PM
Walker.pm is a class implementing a webrobot. Walker supports HTTP/1.1, redirects (301/302/303/307), gzip compressed pages and cookies (partly). This is a bèta version.
CONSTRUCTOR
$robot = walker->new( %options );
This method constructs a new walker object and returns it. Key/value pair arguments may be provided to set up the initial state. The following options correspond to attribute methods described below:
URL - The URL you want to crawl, default empty;
AGENT - The user agent string sent as the "User-Agent" header in the requests;
MAX_FILESIZE - maximum bytes the robot may transfer, default 262144;
MAX_REDIRECT - The number of times a 301, 302, 303 or 307 redirect may followed automaticly by the bot, default 1;
HTTP_REFERER - The initial referrer sent as the "Referer" header in the first request. Default no referrer will be sent.
METHODS
site
$robot->site;
$robot->site( $url );
Get the current url or set the url to crawl.
agent
$robot->agent;
$robot->agent( $product_id );
Get/set the product token that is used to identify the user agent on the network. The agent value is sent as the "User-Agent" header in the requests. If $product_id is undef the default is used. The default is the constant HTTP_USER_AGENT specified in the walker.pm file.
cookie
$robot->cookie;
$robot->cookie( $value );
Get the current value of the
Set-Cookie response header or set the
Cookie request header for a crawl. If the response contains duplicate
Set-Cookie fields, the multiple values are separated by '; '.
referer
$robot->referer;
$robot->referer( $url );
The referer method allows you to specify the value is sent as the "Referer" header in the requests. (note: the "referrer", the header field is misspelled, see rfc2616 sec. 14)
max_redirect
$robot->max_redirect;
$robot->max_redirect( $value );
Get/set the maximum number of times a 301, 302, 303 or 307 redirect may followed by the bot automaticly. A value of 0 means no redirects will be followed. The method is_redirect indicates if a redirect is followed during the crawl. If the maximum limit is reached, walker gives a "Redirect limit exceeded" error.
walk
$robot->walk();
Performs a GET request based on the given URL and the initial AGENT, HTTP_REFERER values. A maximum of MAX_FILESIZE bytes will be transfered.
If the HTTP status code is a 301, 302, 303 or 307 redirect and MAX_REDIRECT > 0, the redirect will be followed automaticly.
The method is_redirect indicates if redirection occurred during the crawl.
The method is_error indicates if the response was an error.
is_error
$robot->is_error;
Returns true if an error occurred during the last walk, like a 404 response.
is_redirect
$robot->is_redirect;
Returns true if a redirect was followed during the last walk. This method is usefull to detect internal URL changes, in case of following redirects automaticly. Call the $robot->site method to get the new URL.
get_status
$robot->get_status;
Returns the HTTP status code. The code is a 3 digit number that encode the overall outcome of a HTTP response.
get_message
$robot->get_message;
Returns the HTTP status message. The message is a short human readable single line string that explains the response code.
get_contents
$robot->get_contents;
Returns the HTML associated with the crawled url.
get_charset
$robot->get_charset;
Returns the charset of the HTML associated with the crawled url (value of the Content-Type response header).
header
$robot->header;
$robot->header( $field );
This method returns a reference to a hash with the server's response header.
If $field is provided, this method returns the value of a header field. If the response did not contain the field, this method returns an empty string. If the response contains duplicate fields, the multiple values are separated by '; '.
links
$robot->links( \%links );
Extracts the links (A HREF) from the crawled HTML and returns a reference to a hash with the links. A key of the hash is an absolute URL, a value is the number of occurrences the link was found.
BUGS
Walker.pm is still under development. If you find a bug, please feel free to report them by leaving a comment. Include your perl snippets with
[geshi lang=perl ln=n] ..perl code here [/geshi].
COPYRIGHT
Copyright © 2006, 2007 www.alles-over.com
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.