WWW::Scraper::JustTechJobs - Scrapes Just*Jobs.com


Bundle-WWW-Scraper-Job documentation  | view source Contained in the Bundle-WWW-Scraper-Job distribution.

Index


NAME

Top

WWW::Scraper::JustTechJobs - Scrapes Just*Jobs.com

SYNOPSIS

Top

    require WWW::Search;
    $search = new WWW::Scraper('JustTechJobs');

DESCRIPTION

Top

This class is an JustTechJobs specialization of WWW::Search. It handles making and interpreting Just*Jobs searches http://www.Just*Jobs.com (where * is 'Perl', 'Java', etc).

OPTIONS

Top

search_debug, search_parse_debug, search_ref Specified at WWW::Search.

AUTHOR

Top

WWW::Scraper::JustTechJobs is written and maintained by Glenn Wood, http://search.cpan.org/search?mode=author&query=GLENNWOOD.

COPYRIGHT

Top

XML Scaffolding

Top

Look at the idea from the perspective of the XML "scaffold" I'm suggesting for parsing the response HTML.

(This is XML, but looks superficially like HTML)

<HTML> <BODY> <TABLE NAME="name" or NUMBER="number"> <TR TYPE="header"/> <TR TYPE = "detail*"> <TD BIND="title" /> <TD BIND="description" /> <TD BIND="location" /> <TD BIND="url" PARSE="anchor" /> </TR> </TABLE> </BODY> </HTML>

This scaffold describes the relevant skeleton of an HTML document; there's HTML and BODY elements, of course. Then the <TABLE> entry tells our parser to skip to the TABLE in the HTML named "name", or skip "number" TABLE entries (default=0, to pick up first TABLE element.) Then the TABLE is described. The first <TR> is described as a "header" row. The parser throws that one away. The second <TR> is a "detail" row (the "*" means multiple detail rows, of course). The parser picks up each <TD> element, extracts it's content, and places that in the hash entry corresponding to its BIND= attribute. Thus, the first TD goes into $result->_elem('title') (I needed to learn to use LWP::MemberMixin. Thanks, another lesson learned!) The second TD goes into $result->_elem('description'), etc. (Of course, some of these are _elem_array, but these details will be resolved later). The PARSE= in the url TD suggests a way for our parser to do special handling of a data element. The generic scaffold parser would take this XML and convert it to a hash/array to be processed at run time; we wouldn't actually use XML at run time. A backend author would use that hash/array in his native_setup_search() code, calling the "scaffolder" scanner with that hash as a parameter.

As I said, this works great if the response is TABLE structured, but I haven't seen any responses that aren't that way already.

This converts to an array tree that looks like this:

    my $scaffold = [ 'HTML', 
                     [ [ 'BODY', 
                       [ [ 'TABLE', 'name' ,                  # or 'name' = undef; multiple <TABLE number=n> mean n 'TABLE's here ,
                         [ [ 'NEXT', 1, 'NEXT &gt;' ] ,       # meaning how to find the NEXT button.
                           [ 'TR', 1 ] ,                      # meaning "header".
                           [ 'TR', 2 ,                        # meaning "detail*"
                             [ [ 'TD', 1, 'title' ] ,         # meaning clear text binding to _elem('title').
                               [ 'TD', 1, 'description' ] ,
                               [ 'TD', 1, 'location' ] ,
                               [ 'TD', 2, 'url' ]             # meaning anchor parsed text binding to _elem('title').
                             ]
                         ] ]
                       ] ]
                     ] ]
                  ];





Bundle-WWW-Scraper-Job documentation  | view source Contained in the Bundle-WWW-Scraper-Job distribution.