1
votes

I have a variable ($output) in php that holds all html page that will be rendered to the browser but I need to replace all image src with data:image to get a lazyload js to work.

The requirements are:

  • img src don't have the same structure, we have:

    <img src="img.jpg" alt='' />

    <img alt="text" src='img.gif'>

    <img class="myclass" src="img.png" alt='' />

    ... etc

  • I only want to replace images that are between <body {can have optional text}> and </body>

  • Don't replace img tag between <script {optional text here}> and </script>

Thanks

1
Could you show what you've tried so far please?Jerry
Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com for examples of how to properly parse HTML with modules that have already been written, tested and debugged.Andy Lester

1 Answers

0
votes

The mistake a lot of people make with regular expressions is trying to write one gigantic regular expression that does everything. This way lies madness. Not only may it be impossible (depending on the problem), but it will be complicated, ugly, and fragile. Far better to break things up into manageable steps.

You say you only want to replace <img> tags within <body>, but the only place <img> tags are valid is within <body>, so I'm going to ignore this. If you really need to ignore <img> tags outside of <body>, you can wrap the whole thing in yet another preg_replace_callback to pluck the <body> out of your input.

So, the approach I've taken is to use two regular expressions: one to match all instances of the <img> tag in your input, and another to replace the alt attribute. To accomplish this, I use preg_replace_callback:

$output = preg_replace_callback( '/<img .*?>/', function($matches) {
        return preg_replace( '/\bsrc\s*=\s*[\'"](.*?)[\'"]/', 
            'data-image="$1"', $matches[0] );
}, $input );

Note the use of the lazy quantifier ? on the repetition metacharacters *: without this, two consecutive <img> tags will be treated as one big one, which is not what we want. In the replacement function, I look for the src attribute and replace it with the data-image attribute.

Here's where this solution will fail:

  • If you have apostrophes in quote-delimited src attributes (<img src="what's_up_doc.jpg">) or vice-versa. If you need to solve this, you'll have to have two different replacement regexes, one to handle double-quoted attributes, and one to handle single-quoted attributes.
  • If your <img> tags span multiple lines. If this is a problem, in the outer regexp, you can use [^] instead of . to match everything, including newlines.