Easily retrieve web content with Perl

Demonstration of the combined power of WWW::Mechanize and HTML::TreeBuilder Perl modules with the automatic insertion of a subject cloud on Koha OPAC’s homepage

I write this note after my recent discovery of two Perl modules that, when combined, prove surprisingly useful to recover most of the available content on the World Wide Web. WWW::Mechanize (or Mech for short) is a Perl module used for automating interaction with websites. This is a very full module that supports, among others, SSL and HTTP authentication. HTML::TreeBuilder is for its part an HTML parser. It lets you extract specific content from an HTML page.

These modules offer too many possibilities to be described in this article. This is why we will demonstrate them with a simple example: create a script that will fill the OpacNavRight system preference with the subject cloud.

Prerequisite. Knowing that this script should change a Koha system preference, it will be placed on the Koha server. Perl modules WWW::Mechanize (libwww-mechanize-perl Debian package) and HTML::TreeBuilder (libhtml-tree-perl Debian package) must also be installed.

Technically. For our example, our script (named getopactagssubject.pl) will be placed in the ~ path. It’s short and well commented, so we let you study it :

#! /usr/bin/perl
use Modern::Perl;
use C4::Context;# required for Koha's implementation
use WWW::Mechanize;# required to use WWW::Mechanize
use HTML::TreeBuilder;# required to use HTML::TreeBuilder
# we put the url to parse in $uri
my $uri='http://opac-koha.com/cgi-bin/koha/opac-tags_subject.pl';
# we put the number of subjects to display in $num
my $num=40;
# $content will contain result to display
my $content="";
# we create new WWW::Mechanize object, named $mech
my $mech = WWW::Mechanize->new();
# we request mech to get $uri html page in $response
my $response = $mech->get( $uri );
# if we've got a response, we recall the url in the same manner as if we had defined to "show up to 40 subjects"
if($response->is_success()) {
# There are two "input text" on our page: one to do a catalog search
# and the other to set the number of subjects to display and which is identified by its name attribute whose value is "number"
my %fields=("number"=>$num);
$response = $mech->submit_form(
with_fields => \%fields,
);
if($response->is_success()) {
# our response now contains the entire html page, we only want subjects
# this is why we need HTML::TreeBuilder. So we create its object
my $tree = HTML::TreeBuilder->new_from_content( $mech->content );
# subjects are in the div with the id : "subjectcloud"
my $divsubjects = $tree->look_down( _tag => 'div', id => qr/subjectcloud/ );
# into this div, each subject is a html tag "a"e we put them in @subjects
my @subjects = $divsubjects->look_down( _tag => 'a' );
# at last, we add each subject, in html format, in our result
foreach my $subject(@subjects) {
$content.=$subject->as_HTML." ";
}
}
}
# it only remains to add $content into OpacNavRight
if ($content) {
# do not forget to escape quotes
$content=~s/"/\\"/g;
my $dbh = C4::Context->dbh;
C4::Context->disable_syspref_cache() if ( defined( C4::Context->disable_syspref_cache() ) );
$dbh->do('UPDATE systempreferences SET value="'.$content.'" WHERE variable="OpacNavRight"');
}

Finally, in a terminal, it remains to load KOHA_CONF et PERL5LIB before calling our script, for example :
export KOHA_CONF=/home/koha/etc/koha-conf.xml;export PERL5LIB=/home/koha/src; perl ~/getopactagssubject.pl

You’ll can put this into crontab to automate its execution, for example every night.

Author : Stéphane Delaune

Share

Leave a Reply

Your email address will not be published. Required fields are marked *