0
votes

I've tried to strip html tags using regex replace with pattern "<[^>]*>" from word generated html that looks like this:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:st1="urn:schemas-microsoft-com:office:smarttags" xmlns="http://www.w3.org/TR/REC-html40"&gt;

<head> <meta http-equiv=Content-Type content="text/html; charset=iso-8859-2"> <meta name=Generator content="Microsoft Word 11 (filtered medium)"> <!--[if !mso]> <style>

v:* {behavior:url(#default#VML);}

o:* {behavior:url(#default#VML);}

w:* {behavior:url(#default#VML);}

.shape {behavior:url(#default#VML);}

</style> <![endif]--><o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="place" downloadurl="http://www.5iantlavalamp.com/"/&gt; <!--[if !mso]> <style>

st1:*{behavior:url(#default#ieooui) }

</style> <![endif]--> <style> <!-- /* Font Definitions / @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} / Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:Arial; color:windowtext;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:Arial; color:navy;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} --> </style>

</head>

Everything works fine except for the bolded lines above, anybody got ideas how to match the them to?

Thanks,

Aleksandar

3
You should put your HTML code into the CODE block (101/010 button). It makes reading it much easier - Mitch Dempsey

3 Answers

3
votes

Your regex does not take into account that comments can contain > characters that do not terminated the comment. Try this regex:

<!--.*?-->|<[^>]*>

You'll have to turn on the option to make . match line breaks. How to do that depends on the application or programming language you're using this regex with. E.g. in Perl you'd use the /s flag. In .NET you'd set RegexOptions.SingleLine.

0
votes

People generally advise the use of a parser instead of regex when dealing with HTML.

In case you have to use a regex :) you could use-

<style>.*?</style>