Webscraping by example

Max Maischein



  • What is Web Scraping?

  • Examples of scraping web pages

  • Considerations for implementing your own scraper

Who am I?

  • Max Maischein

  • DZ BANK Frankfurt

  • Deutsche Zentralgenossenschaftsbank

  • Information management

My Leitmotiv


  • If I can do it manually

  • ... then the computer can repeat it

  • ... correctly every time

My Environment


Intranet web automation and scraping (WWW::Mechanize::Firefox)

My Environment


Intranet web automation and scraping (WWW::Mechanize::Firefox)


  • People ask about web scraping

  • People ask about HTML parsing

What Is Web Scraping?

Web Scraping is

  • the automated process of

  • extracting data

  • from resources served over HTTP

  • and encoded as HTML

  • (or Javascript)

What you need to know

  • Perl

  • A bit of HTTP

  • A bit of HTML

Concepts and Modules

  • WWW::Mechanize::Firefox::DSL

  • Shared API with others

  • WWW::Scripter, WWW::Mechanize

  • Same concepts elsewhere

  • Mojo::DOM.

Aspects of Web Scraping

Not everything that is technically possible is also allowed or desired

  • Can you avoid scraping?

     1:    Database Access
     2:    Database dump
  • Is scraping allowed?

     1:    Terms of Service (TOS)
     2:    TOS applicable?


  • Identify the steps a human takes

  • -> Navigation

  • -> Content Extraction

  • Automate these steps

  • Repeat

Example: Extracting news posts

Look at the page

Print the page HTML

 1:  #!perl -w
 2:  use WWW::Mechanize::Firefox;
 3:  my $mech = WWW::Mechanize::Firefox->new();
 5:  $mech->get('http://conferences.yapceurope.org/gpw2011');
 6:  print $mech->content();

Some simplicity


 1:  #!perl -w
 2:  use WWW::Mechanize::Firefox;
 3:  my $mech = WWW::Mechanize::Firefox->new();
 5:  $mech->get('http://conferences.yapceurope.org/gpw2011');
 6:  print $mech->content();

With WWW::Mechanize::Firefox::DSL


 1:  #!perl -w
 2:  use WWW::Mechanize::Firefox::DSL;
 5:         get('http://conferences.yapceurope.org/gpw2011');
 6:  print        content();

Web Page

Web Page



  • Add-on for Firefox

  • Visual tool to inspect elements on the web page

Our goal

Extract news posts

First CSS selector

 1:    .newsbox

Refined CSS selector

 1:    .newsbox h3 a

Refined CSS selector

 1:    .newsbox h3 a

Refined CSS selector

 1:    .newsbox h3 a
 2:        Call for Papers online

Refined CSS selector

 1:    .newsbox h3 a
 2:        Call for Papers online
 4:    .newsbox h3+p+p
 5:        Endlich ist es so weit ...

Perl script

 1:  #!perl -w
 2:  use strict;
 3:  use WWW::Mechanize::Firefox::DSL;
 5:  get_local('html/gpw-2011-news.htm');

Perl script

 1:  #!perl -w
 2:  use strict;
 3:  use WWW::Mechanize::Firefox::DSL;
 5:  get_local('html/gpw-2011-news.htm');
 7:  my @headline = selector('.newsbox h3 a');
 8:  my @text = selector('.newsbox h3+p+p');

Perl script

 1:  #!perl -w
 2:  use strict;
 3:  use WWW::Mechanize::Firefox::DSL;
 5:  get_local('html/gpw-2011-news.htm');
 7:  my @headline = selector('.newsbox h3 a');
 8:  my @text = selector('.newsbox h3+p+p');
10:  for my $article (0..$#headline) {
11:      print $headline[$article]->{innerHTML}, "\n";
12:      print "---", $text[$article]->{innerHTML}, "\n";
13:  };


 1:  demo/01-scrape-gpw-2011.pl

Refining the extraction

  • Command line might be more convenient

  • Command line XPath extractor like

  • Mojolicious::get

  • App::scrape

  • WWW::Mechanize::Firefox/examples/scrape-ff.pl


The example showed

  • Data extraction

  • using CSS selectors

  • WWW::Mechanize + HTML::TreeBuilder::XPathEngine + HTML::XPath::Selector

  • WWW::Mechanize::Firefox

  • App::scrape

  • Mojo::DOM

More Complex Example situation (Wikipedia)

  • Extract images of (DC) Super Heroes from Wikipedia

  • Input: Name of Super Hero

  • Output: Image URL (or data) of Super Hero Image on Wikipedia and description text

Sample Session

  • Enter Super Hero name (Navigation)

  • See description on resulting page (Extraction)

  • See picture on resulting page (Extraction)

First stab

 1:    use WWW::Mechanize::Firefox::DSL;
 2:    use strict;
 4:    # Navigation
 5:    my ($hero) = @ARGV;
 6:    my $url = 'http://en.wikipedia.org';
 7:    get $url;
 8:    print content; # are we on the right page?

First stab

 1:    use WWW::Mechanize::Firefox::DSL;
 2:    use strict;
 4:    # Navigation
 5:    my ($hero) = @ARGV;
 6:    my $url = 'http://en.wikipedia.org';
 7:    get $url;
 8:    print content; # are we on the right page?
10:    print title;
11:    # Wikipedia, the free encyclopedia ...

Check for "where we are"

Important, not now, but later when the website changes

 1:    my ($hero) = @ARGV;
 2:    my $url = 'http://en.wikipedia.org';
 3:    get $url;
 4:    on_page("Wikipedia, the free encyclopedia");

Check for "where we are"

Important, not now, but later when the website changes

 1:    my ($hero) = @ARGV;
 2:    my $url = 'http://en.wikipedia.org';
 3:    get $url;
 4:    on_page("Wikipedia, the free encyclopedia");
 6:    ...
 7:    on_page("$hero - Wikipedia");
 9:    sub on_page {
10:        my ($expected) = @_;
11:        if (title !~ /\Q$expected\E/) {
12:            croak sprintf "Wrong page [%s], expected [%s]",
13:                title, $expected;
14:        };
15:    };

Fill in the field

  • Find the field

     1:    search

Fill in the field

  • Find the field

     1:    search
  • Find the field name

Fill in the field

  • Find the field

     1:    search
  • Find the field name

 1:    submit_form with_fields( {
 2:    	search => $hero,
 3:    });

Where are we now?

 1:    print title;
 3:    # Superman - Wikipedia, the free ...

Where are we now?

 1:    print title;
 3:    # Superman - Wikipedia, the free ...
 5:    on_page($hero);


 1:  Plastix man


 1:  Plastix man
 3:  ...
 4:  Wrong page [Plastix man - Search results -...]

What do we have now?

  • Enter Super Hero name (Navigation)

  • See description on resulting page (Extraction)

  • See picture on resulting page (Extraction)

Find the description

  • Find the text selector (Firebug, App::scrape, examples/scrape-ff.pl)

     1:    .infobox +p
  • Extract it

     1:    print $_->{innerHTML}
     2:        for selector('.infobox +p');

Find the image

  • Find the image selector (Firebug, ...)

     1:    .infobox a.image img
  • Extract it

     1:    print $_->{src}
     2:        for selector('.infobox a.image img');
     4:    # ttp://upload.wikimedia.org/wikipedia/en/thumb/7/72/Superman.jpg/250px-Superman.jpg


The example showed

  • Site navigation

  • Data extraction using CSS selectors

  • Navigation validation checking

Example: Extracting Google+ profiles


  • Search with Google for the name

  • Extract the information from the profile page

  • Extract "people in circles"

Web Page

 1:  <h2 class="a-b-D-Nd-aa d-q-p">Links</h2>
 2:  <ul class="a-b-D-G-lg Nd">
 3:  <li><img alt="" class="a-b-D-Mf" src="113231249772841733835_files/favicons_007.png">
 4:    <div class="a-b-D-k k"><a class="a-b-D-k-cg url" href="http://corion.net/" 
 5:    ...

Find the related links

Find the related links

 1:  div.k a.url
 3:  my @links = selector('div.k a.url');
 5:  for my $link (@links) {
 6:    print $link->{innerHTML}, "\n";
 7:    print $link->{href}, "\n";
 8:    print "\n";
 9:  };

Live demo

 1:  04-scrape-gplus-profile.pl

Find the "circled" contacts

Find the "circled" contacts

Find the "circled" contacts

Google is heavily set on JSON / dynamic Javascript

 1:    var OZ_initData = 
 2:    ...
 3:    ,""]
 4:    ,1,,1]
 5:    ,[[41,[["113771019406524834799","/113771019406524834799",...,"Paul Boldra"]
 6:    ,["101868469434747579306","/101868469434747579306",...,"Curtis Poe"]
 7:    ,["114091227580471410039","/114091227580471410039",...,"James theorbtwo Mastros"]
 8:    ,["116195765101222598270","/116195765101222598270",...,"Michael Kröll"]
 9:    ...

Find the "circled" contacts

Google is heavily set on JSON / dynamic Javascript

 1:    var OZ_initData = 
 2:    ...
 3:    ,""]
 4:    ,1,,1]
 5:    ,[[41,[["113771019406524834799","/113771019406524834799",...,"Paul Boldra"]

We want

 1:    OZ_initData["5"][3][0]

extraction from "native" JS

 1:    my ($info,$type)
 2:      = eval_in_page('OZ_initData["5"][3][0]');
 4:    my $items = $info->[0];
 5:    print "$items accounts in Circles\n";
 6:    for my $i (@{ $info->[1] }) {
 7:        print $i->[0], "\t", $i->[3],"\n";
 8:    };

Live demo

 1:  demo/05-scrape-gplus-circle.pl


This example showed

  • Site/application navigation

  • Data extraction

  • from JSON / Javascript structures

  • WWW::Mechanize::Firefox

  • Maybe also WWW::Scripter

Building your own scraper

  • You know you want to

  • Navigate pages

  • Formulate extraction queries

  • Queries / Navigation as data or as (Perl) code?

Writing your own scraper

Look at and use other scrapers to find out what you need and what you don't like:

  • Configuration

  • Tradeoff - creation speed vs. environment

  • Navigation

  • Extraction

  • Speed

  • External support (Flash, Java, Javascript, ...)

Improving scraping stability and performance

  • Your bottleneck likely is network latency and network IO

  • Avoid navigation by going directly to the page with the data

     1:    http://en.wikipedia.org/wiki/Superman
     2:    http://en.wikipedia.org/wiki/$super_hero
  • Does not work nicely for Google+

     1:    https://plus.google.com/v/xXxXxXx
  • Avoid overloading the remote server

  • Reduce latency by running multiple scrapers at the same time (AnyEvent::HTTP)

  • Avoid overloading the remote server

Scraping Sites that don't want to be scraped

On the internet, nobody knows that you're a dog(strike)Perl script

  • Automate the browser


  • Send the same data as the browser

Scraping Sites that don't want to be scraped

Automate the browser

  • WWW::Mechanize::Firefox

  • Visual

  • Requires a terminal or VNC display

  • Portable Firefox to isolate

Scraping Sites that don't want to be scraped

Wireshark / Live HTTP Headers

  • Replicate what the browser sends

     1:  GET /-8RekFwZY8ug/AAAAAAAAAAI/AAAAAAAAAAA/7gmV6WFSVG8/photo.jpg?sz=80 HTTP/1.1
     2:  Host: lh5.googleusercontent.com
     3:  User-Agent: Mozilla/5.0 (Windows; U; ...
  • Differences

     1:    ->user_agent()
     2:    ->cookies()
     4:    Accept-Encoding
     5:    ...


  • Many frameworks with different tradeoffs

  • Same tradeoffs for your framework

  • The network hides your program

  • Automate the browser

  • Mimic the browser

Sample code

All samples will go online at


Thank you

All samples will go online at



Max Maischein (corion@cpan.org)

"Cognitive Hazard" by Anders Sandberg

Super hero images from Wikipedia

Bonus Section

Formulating extraction queries

HTML::Selector::XPath - convert from CSS selectors to XPath

pQuery - jQuery-like extraction

 1:    $dom->q('p')->q('div')->...

Mojo::DOM - AUTOLOAD jQuery-like extraction

 1:    $dom->p->div->(...)

Web::Scraper - specific DSL

 1:    scrape { title => ...,
 2:             'url[]' => scrape { ... }
 3:           }

Writing your own scraper

Reuse the work of others

Tatsuhiko Miyagawa

  • HTML::Selector::XPath - use CSS3 selectors

  • HTML::AutoPagerize - walk through paged results