CodeLab
23Sep/10Off

Quick sitemaps with PERL, WWW::Mechanize and the Template Toolkit

Recently, I redesigned the http://www.ferrarihaines.com website, and re-wrote its back-end to make it Catalyst-based (I know, it seems like an overkill, but this sets a nice base for future plans). While getting ready to launch the new version I needed to update the sitemap.xml, but despite of FH being a small site, working out manually all the new links seemed a very tedious task.

There are many online CGIs that could have done it for me, but I would rather have a script that I can run whenever I need. So, there they were, WWW::Mechanize & the Template-toolkit to the rescue:

#!/usr/bin/env perl
use strict;

{
    package Sitemap;
    use strict;
    use warnings;
    use Template;
    use WWW::Mechanize;
    use Time::HiRes qw/time/;

    sub new {
        my ( $class, $host, $frequency ) = @_;

        return bless {
            mech      => WWW::Mechanize->new( autocheck => 1 ),
            host      => $host,
			frequency => $frequency || 'weekly',
            tt        => Template->new(),
            visited   => {},
          },
          ref $class || $class;
    }

    sub mech { shift->{mech} }
    sub host { shift->{host} }
    sub tt   { shift->{tt} }

    sub build {
        my $self  = shift;
        my @links = ( $self->host );
        while ( my $to_visit = shift @links ) {

			# of course skip if already visited
            next if  $self->{visited}->{$to_visit};

			# fetch url and let us know the response time
            print "fetching $to_visit";
			my $start = time();
            $self->mech->get($to_visit);
            printf " %.03f s\n",  time() - $start;

            $self->{visited}->{$to_visit} = 1;

            foreach my $found_link ( ( $self->mech->links() ) ) {

                if (   $found_link->url() !~ /^$self->{host}.*/o
                    or $found_link->url() =~ /^mailto/go
					or $self->{visited}->{$found_link}) {
				    # skip links out, mailtos or already visited
                    next;
                }

                # add to queue to visit, removing trailing "/"
                ($found_link = $found_link->url()) =~ s/\/$//o;
                push( @links, $found_link )
                  unless $self->{visited}->{$found_link};
            }
        }

        $self->save_sitemap();
    }

    sub save_sitemap {
        my $self = shift;
        my $tt = join '', <main::DATA>;
        $self->tt->process( \$tt,
                        { urls      => [ keys %{ $self->{visited} } ],
                          frequency => $self->{frequency},
                         },
                         'sitemap.xml' )
                   or die $self->tt->error();
    }

1;
}

# Run
print "Usage: $0 http://your.site.com\n" and exit(1)
   unless $ARGV[0];

my $sitemap = Sitemap->new(@ARGV);
$sitemap->build();

__DATA__
[%~ USE date(format =  '%Y-%m-%dT%H:%M:%S+00:00') ~%]
<?xml version="1.0" encoding="UTF-8"?>
<urlset
	  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
	  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	  xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
[%- FOREACH url IN urls %]
   <url>
	  <loc>[%~ url ~%]</loc>
	  <lastmod>[%~  date.format  ~%]</lastmod>
	  <changefreq>[%~ frequency ~%]</changefreq>
	  <priority>[%~
	      # you may want to define different priorities for diff
		  # paths in your site here
		  IF url.match('spanish_courses');
			  0.40;
		  ELSIF url.match('services');
			 0.32;
		  ELSIF url.match('about');
			 0.256;
		  ELSE;
			 0.205;
		  END;
	  ~%]</priority>
   </url>
[%- END %]
</urlset>

That's it! You need to alter lines 100-108 to set priorities for your pages. Save the code as build_sitemap.pl and call it passing the main url of your site as the first argument and the 'change frequency' of your sitemap:

$ ./build_sitemap.pl http://www.yoursite.com weekly

Happy sitemapping!

Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.