2
votes

I would like a regex to remove html tags and &nbsp, &quot etc from a string. The regex I have is to remove the html tags but not the others mentioned. I'm using .Net 4

Thanks

CODE:

     String result = Regex.Replace(blogText, @"<[^>]*>", String.Empty);
2
Before you proceed, take a look here: stackoverflow.com/questions/1732348/… - Zruty
Regex and HTML are never a good mix. Have a look @ stackoverflow.com/questions/5496704/strip-html-and-css-in-c - Michael Paulukonis
this could be easily done with HtmlAgilityPack, see Stripping all html tags with Html Agility Pack - Oleks

2 Answers

1
votes

Don't use Regular Expressions, use the HTML Agility pack:

http://www.codeplex.com/htmlagilitypack

0
votes

If you want to build on what you what you already created, you can change it to the following:

String result = Regex.Replace(blogText, @"<[^>]*>|&\w+", String.Empty);

It means...

  1. Either match tags as you defined...
  2. ...or match a & followed by at least one word character \w -- as many as possible.

Neither of these two work in all nasty cases, but usually it does.