Regex to Clean Out HTML from Text

Ahh, the power (and sometimes slight confusion) of regular expressions.  This seems to work well to remove HTML from text:

string htmlstring = { some chunk o html infested text };

Regex cleanOutHtml =
    new Regex(@”\s]+))?)+\s*|\s*)/?>”);


string onlytext = cleanOutHtml.Replace(htmlstring, “”);

Weee there you have it.

This entry was posted in C#, Regular Expressions. Bookmark the permalink.

One Response to Regex to Clean Out HTML from Text

  1. Greg says:

    Here’s another way to do it that lets you have a “white-list” of acceptable html:

    const string pattern = @"</?(?(?=b|span|i|ul|li>)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
    string filtered = Regex.Replace(html, pattern, "", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    return filtered;

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>