The mistake a lot of people make with regular expressions is trying to write one gigantic regular expression that does everything. This way lies madness. Not only may it be impossible (depending on the problem), but it will be complicated, ugly, and fragile. Far better to break things up into manageable steps.
You say you only want to replace <img>
tags within <body>
, but the only place <img>
tags are valid is within <body>
, so I'm going to ignore this. If you really need to ignore <img>
tags outside of <body>
, you can wrap the whole thing in yet another preg_replace_callback
to pluck the <body>
out of your input.
So, the approach I've taken is to use two regular expressions: one to match all instances of the <img>
tag in your input, and another to replace the alt
attribute. To accomplish this, I use preg_replace_callback
:
$output = preg_replace_callback( '/<img .*?>/', function($matches) {
return preg_replace( '/\bsrc\s*=\s*[\'"](.*?)[\'"]/',
'data-image="$1"', $matches[0] );
}, $input );
Note the use of the lazy quantifier ?
on the repetition metacharacters *
: without this, two consecutive <img>
tags will be treated as one big one, which is not what we want. In the replacement function, I look for the src
attribute and replace it with the data-image
attribute.
Here's where this solution will fail:
- If you have apostrophes in quote-delimited
src
attributes (<img src="what's_up_doc.jpg">
) or vice-versa. If you need to solve this, you'll have to have two different replacement regexes, one to handle double-quoted attributes, and one to handle single-quoted attributes.
- If your
<img>
tags span multiple lines. If this is a problem, in the outer regexp, you can use [^]
instead of .
to match everything, including newlines.