4
votes

Given this DOM

$html=<<<'EOD'
<div class='container clickable' data-param='{"footer":"<div>Bye</div>","info":"We win"}'>
 <img src='a.jpg' />
</div>
<a href='a.html'>The A</a>
<span></span>
<span data-span-param='{"detailTag":"<span class=\"link\">Anything here</span>"}'>
 <a></a>
</span>  
EOD;  

I am trying to preg_match_all html tags with using this expression:

$tags = array();
if(preg_match_all('~<\s*[\w]+[^>]*>|<\s*/\s*[\w]+\s*>~im',$html,$matchall,PREG_SET_ORDER)){
   foreach($matchall as $m){
       $tags[] = $m[0];
   }
}  
print_r($tags);

The output of this expression is:

Array
(
[0] => < div class='container clickable' data-param='{"footer":"< div>
[1] => < /div>
[2] => < img src='a.jpg' />
[3] => < /div>
[4] => < a href='a.html'>
[5] => < /a>
[6] => < span>
[7] => < /span>
[8] => < span data-span-param='{"detailTag":"< span class=\"link\">
[9] => < /span>
[10] => < a>
[11] => < /a>
[12] => < /span>
)

My expected output is this:

Array
(
[0] => < div class='container clickable' data-param='{"footer":"< div>Bye< /div>","info":"We win"}'>
[1] => < img src='a.jpg' />
[2] => < /div>
[3] => < a href='a.html'>
[4] => < /a>
[5] => < span>
[6] => < /span>
[7] => < span data-span-param='{"detailTag":"< span class=\"link\">Anything here< /span>"}'>
[8] => < a>
[9] => < /a>
[10] => < /span>
)

I need a help with an expression to solve this problem.

3
Have you considered not using a regex and instead using a DOM parser? See: stackoverflow.com/a/1732454Rocket Hazmat
i've solved the problem and it's working. see my answer below.amachree tamunoemi
@user5634507. My fiddle too.user4227915

3 Answers

1
votes

This will match all html tags and will not capture tags enclosed in double or single quotes

<?php
$html=<<<EOD
<div class='container clickable' data-param='{"footer"<div>Bye</div>","info":"We win"}'>
<img src='a.jpg' />
</div>
<a href='a.html'>The A</a>
<span></span>
<span data-span-param='{"detailTag":"<span class=\"link\">Anything here</span>"}'>
<a></a>
</span>
EOD;

$html = preg_replace('~\&lt\;~is','<',$html);
$html = preg_replace('~\&gt\;~is','>',$html);
//$html = preg_replace('~\&quot\;~is','"',$html);
$html = preg_replace('~=\s*\'\s*\'~is','=\'.\'',$html);
$html = preg_replace('~=\s*"\s*"~is','="."',$html);

if(preg_match_all('~((?<==\')(?:.(?!\'))*.)\'|((?<==")(?:.(?!"))*.)"~im',$html,$matchall,PREG_SET_ORDER)){
  foreach($matchall as $m){
    if(preg_match('~\<~is',$m[0],$mtch1)||preg_match('~\>~is',$m[0],$mtch2)){
        $end = $m[0][(strlen($m[0])-1)];
        $replace1 = substr($m[0],0,(strlen($m[0])-1));
        $replace = preg_replace('~"~is','&quot;',$replace1);
        $replace = preg_replace('~<~is','&lt;',$replace);
        $replace = preg_replace('~>~is','&gt;',$replace);
        $html = preg_replace("~".preg_quote(($end.$replace1.$end),'~')."~is",$end.$replace.$end,$html);
    }
  }
}

$tags = array();
if(preg_match_all('~<\s*[\w]+[^>]*>|<\s*/\s*[\w]+\s*>~im',$html,$matchall,PREG_SET_ORDER)){
  foreach($matchall as $m){ 
    $tags[] = $m[0];
  }
}

print_r($tags);
?> 

Outputs:

Array  
(  
[0] => <div class='container clickable' data-param='{&quot;footer&quot;:&quot;&lt;div&gt;Bye&lt;/div&gt;&quot;,&quot;info&quot;:&quot;We win&quot;}'>  
[1] => <img src='a.jpg' />  
[2] => </div>  
[3] => <a href='a.html'>  
[4] => </a>  
[5] => <span>  
[6] => </span>  
[7] => <span data-span-param='{&quot;detailTag&quot;:&quot;&lt;span class=\&quot;link\&quot;&gt;Anything here&lt;/span&gt;&quot;}'>  
[8] => <a>
[9] => </a>  
[10] => </span>  
)
1
votes

This regex works in your code, no additional code required:

<\s*(?:/\s*)?\w++(?>[^>'"]++|'[^']+'|"[^"]+")*>

DEMO

0
votes

I think the best way to solve this problem is to use recursive regular expression.

(?!<\s*>)\<(?:(?>[^<>]+)|(?R))*\>

Demo