Web::Scraper

Lately I’ve been using the nifty Web::Scraper by the prolific Tatsuhiko Miyagawa. It exposes a compact DSL of three words, process, process_first and result, to scrape sites based on XPath expressions or CSS selectors:


use Data::Dumper;
use Web::Scraper;
use URI;

# Find the <title> element and place its content
# in a hash reference with the key 'title'

my $scraper = scraper {
process '//title', title => 'TEXT';
};

my $data = $scraper->scrape(
URI->new('http://www.google.com')
);

warn Dumper $data;

Gets you

$VAR1 = {
'title' => 'Google'
};

You can also have it return an arrayref for an element, pass in callbacks, or nest scrapers.

I forked the code and added a hashref option, nice when paired with a callback that returns a hash.

scraper (a disguised constructor) and the keywords are exported into the caller’s namespace. If the potential for collision concerns you, wrap up scraper instantiation in a separate module.

The documentation is a bit thin, so in addition check out Miyagawa’s presentation slides.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: