1
votes

In the string

 <td class="useragent"><a href="/useragents/parse/627832-chrome-windows-blink">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36</a></td>

I am trying to extract and copy to clipboard

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36

Using the regex tester at https://regexr.com/, I found that this regex accomplishes what I am seeking:

(?<=<td class="useragent"><a href=".*">).*(?=</a>)

When I try it in Sublime Text, it doesn't. I'm guessing this has to do with different 'flavors' of RegEx, so how can I change this RegEx to work with Sublime Text?

2
Try this one: <td\s+class="useragent".*><a\s+.*>(.*)</a></td>accdias
What's this (?<=.*) ? Show a perma=link to where you tested this specific regex.user557597
@AntonioDias I receive the message 'Unable to find <td\s+.*><a\s+.*>(.*)</a></td> in selection' when searching through an HTML file full of similar strings to what I postedJohnWick
Maybe it is something related with multiline matching. Unfortunately I don't have Sublime here and my guess was purely based on regex.accdias
@AntonioDias Thanks anyways, luckily someone else was able to help me figure it out. I appreciate the fast response even if it didn't work in my case.JohnWick

2 Answers

1
votes

Sublime Text 3 Regex Solution

You cannot use a lookbehind of unknown length in PCRE regex (it is the regex library that is used in Sublime Text 3). However, since you are using a positive lookbehind, you may use \K match reset operator instead (it will discard all text matched so far from the match memory buffer).

Also, you might consider some enhancements:

  • ".*" might overflow across tags, use "[^"]*" instead
  • .*</a> may get to the last </a> on a line, use .*?</a> to get to the first one
  • If there are line breaks in the <a> node, use (?s) DOTALL inline modifier to make .*? match across lines

Use

(?s)<td class="useragent"><a href="[^"]*">\K.*?(?=</a>)
                                          ^^ 

See the regex demo.

ST3 test:

enter image description here

See Keep The Text Matched So Far out of The Overall Regex Match at regular-expressions.info.

PHP Fallback Using DOM Parsing

You should actually be cautious with parsing out data from arbitrary HTML. If you want to get all the texts you need from a big HTML, you should consider using a full-fledged HTML DOM supporting technology. Here is an example using PHP (see an online PHP demo):

$text = <<<EOD
<td class="useragent"><a href="/useragents/parse/627832-chrome-windows-blink">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36</a></td>
EOD;
$domDocument = new DOMDocument;
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($domDocument);
$nodes = $xpath->query('//td[@class="useragent"]/a');
$res = [];
foreach($nodes as $txt) { 
   array_push($res, $txt->textContent);
}
print_r($res);

Result:

Array
(
    [0] => Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
)

Here, $text is your HTML text, //td[@class="useragent"]/a is an XPath that gets all td nodes whose class attribute value is equal to useragent and then grabs a node inside them. The actual text is returned with $txt->textContent.

0
votes

All I have around here is Python and I tested <td class="useragent"><a .*>(.*)</a></td> against the string you posted and it works. Look

>>> import re
>>> agent=re.compile(r'<td class="useragent"><a .*>(.*)</a></td>')
>>> s='<td class="useragent"><a href="/useragents/parse/627832-chrome-windows-blink">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36</a></td>'
>>> agent.findall(s)
['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36']
>>> 

I hope that helps.