3
votes

I was doing sed /http.*.torrent/s/.*(http.*.torrent).*/\1/;/http.*.torrent/p 1.html to extract links. However since sed lacks non-greedy quantifier (which is needed because further in the line there is again 'torrent'), tried to convert it to perl. Though need help with perl. (Or if you know how to do it with sed, say so.) perl -ne s/.*(http.*?.torrent).*/\1/ 1.html Now I need to add this part, after convering it from sed: /http.*.torrent/p

This was a part of sed /http.*.torrent/s/.*(http.*.torrent).*/\1/;/http.*.torrent/p 1.html

but this didn't work either; sed started but didn't quit, and as I pressed keys they echoed and nothing else.

2
If you gave an example of input, explained the rule for transforming it and showed the desired output then someone could help you do it in Perl without having to figure out what the sed code is trying and failing to do.d5e5
This is an example of a matching line, other can be anything: <a href="https://toPB.torrent" title="Download this torrent"> The goal is to extract https://toPB.torrent for each such line.ccvn
Are you trying to parse the complete html page to extract .torrent links? In that case you might want to dig into HTML::TreeBulder.ssapkota

2 Answers

4
votes

I recommend letting a well proven module such as HTML::LinkExtor do the heavy lifting for you, and use a regexp simply to validate the links that it finds. See the example below of just how easy it can be.

use Modern::Perl;
use HTML::LinkExtor;
use Data::Dumper;

my @links;


# A callback for LinkExtor. Disqualifies non-conforming links, and pushes
# into @links any conforming links.

sub callback {
    my ( $tag, %attr ) = @_;
    return if $tag ne 'a';
    return unless $attr{href} =~ m{http(?:s)?://[^/]*torrent}i;
    push @links, \%attr;
}


# The work is done here: Read the html file, parse it, and move on.
undef $/;
my $html = <DATA>;
my $p = HTML::LinkExtor->new(\&callback);
$p->parse( $html );

print Dumper \@links;

__DATA__
<a href="https://toPB.torrent" title="Download this torrent">The goal</a>
<a href="http://this.is.my.torrent.com" title="testlink">Testing2</a> <a href="http://another.torrent.org" title="bwahaha">Two links on one line</a>
<a href="https://toPBJ.torrent.biz" title="Last test">Final Test</a>
A line of nothingness...
That's all folks.

HTML::LinkExtor lets you set up a callback function. The module itself parses your HTML document to find any links. You are looking for the 'a' links (as opposed to 'img', etc.). So in your callback function you just exit as soon as possible unless you have an 'a' link. Then test that 'a' link to see if there's a 'torrent' name in it, in an appropriate position. If that particular regexp isn't what you need, you'll have to be more specific, but I think it's what you were after. As links are found they're pushed onto a data structure. At the end of my test script I print the structure so you can see what you have.

The __DATA__ section contains some sample HTML snippets, along with junk text to verify that it's only finding links.

Using a well tested module to parse your HTML is so much more durable than constructing fragile regular expressions to do the whole job. Many well-made parsing solutions include regular expressions under the hood, but only to do little bits and pieces of the work here and there. When you start relying on a regexp to do the parsing (as opposed to the identifying of small building blocks), you run out of gas quickly.

Have fun.

3
votes

sed doesn't have non-greedy matching, so your best bet is just to use perl:

perl -ne '/.*?(http.*?.torrent)/ && print "$1\n"' 1.html

The -n argument tells perl to read each line of input (from 1.html in this case, or from stdin if no file(s) are on the cmdline) and run something against each line... the -e gives the "something to execute" on the command line.

The first part of the expression matches against the expression you were looking for, with the parentheses capturing your interesting bits into $1. If it matches, it evaluates to true, and so will then execute the print (giving you your match along with a newline).