1
votes

The task is quite simple: access a url and parse it based on the result. In case there is an error (404, 500 etc etc), take appropriate action. The last piece is the one that I am having issue with.
I have listed both the pieces of code that I currently use. The longer one (LWP+TreeBuilder) works for both conditions ; the shorter one (TreeBuilder) works for the first condition but does not work for the error condition. If I use TreeBuilder and the site returns a 404 or some other error, the script simply exits ! Any ideas ?

Longer code that works

use LWP::Simple;
use LWP::UserAgent;
use HTML::TreeBuilder;

$url="http://some_url.com/blahblah" ;
$response = LWP::UserAgent->new->request( HTTP::Request->new( GET => $url ));
    if ($response->is_success) {

    $p = HTML::TreeBuilder->new();
    $p->parse($response->content);

    } else {

    warn "Couldn't get $url: ", $response->status_line, "\n";

    }

Shorter one that does not

use HTML::TreeBuilder;

$url="http://some_url.com/blahblah" ;

$tree = HTML::TreeBuilder->new_from_url($url) ;
2

2 Answers

3
votes

the script simply exits

No, it throws an exception. You could always catch the exception with eval BLOCK if you so desired.

my $tree = eval { HTML::TreeBuilder->new_from_url($url) }
   or warn($@);
3
votes

To quote the docs:

If LWP is unable to fetch the URL, or the response is not HTML (as determined by content_is_html in HTTP::Headers), then new_from_url dies, and the HTTP::Response object is found in $HTML::TreeBuilder::lwp_response.

Try this:

use strict;
use warnings;
use HTML::TreeBuilder 5; # need new_from_url
use Try::Tiny;

my $url="http://some_url.com/blahblah" ;
my $p = try { HTML::TreeBuilder->new_from_url($url) };
unless ($p) {
    my $response = $HTML::TreeBuilder::lwp_response;
    if ($response->is_success) {
        warn "Content of $url is not HTML, it's " . $response->content_type . "\n";
    } else {
        warn "Couldn't get $url: ", $response->status_line, "\n";
    }
}