0
votes

I am trying to replace every link of an HTML page so people can click on it. But I can't find the Regex for the pattern:

href="Any_URL" except those containing ".js" or ".css" ( in the middle or at the end of the URL)

I tried many patterns like href=".+(.css|.js){0}.*"

The idea is to get the content of a website and replace every URL ( except those containing .js and .css) by href="#" so people can't click on it.

$subject = file_get_contents($url, FILE_USE_INCLUDE_PATH); // get the content of the website

$pattern='#href=".+(.css|.js){0}.*"#i' // doesn't work

$page=preg_replace($pattern, 'href=#', $subject); // replace all the links by something not clickable

return $page;

2
{0} means match zero times, it doesn't mean to ensure it's not there. You would instead need to use an assertion, like \.(?!css|js)\w+ - Niet the Dark Absol
Don't use regular expressions to parse HTML. Use an XPath query, and then you have a simple substring search to do on the returned href attribute values. - miken32

2 Answers

0
votes

{0} only apply to matches. So it won't compute in the match, but could be there. Try this:

href="(.+)(?<!\.css|\.js)"

Regex101

0
votes

This should reject anything with .css or .js in it, using a negative lookahead:

/href=".+\.(?!css|js).+"/