Regex to Clean Out HTML from Text

Ahh, the power (and sometimes slight confusion) of regular expressions.  This seems to work well to remove HTML from text:

string htmlstring = { some chunk o html infested text };

Regex cleanOutHtml =
    new Regex(@”\s]+))?)+\s*|\s*)/?>”);

string onlytext = cleanOutHtml.Replace(htmlstring, “”);

Weee there you have it.

This entry was posted in C#, Regular Expressions. Bookmark the permalink.

1 Response to Regex to Clean Out HTML from Text

  1. Greg says:

    Here’s another way to do it that lets you have a “white-list” of acceptable html:

    [cc lang=”csharp”]
    const string pattern = @”)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([“”,’]?).*?\1?)?)*\s*/?>”;
    string filtered = Regex.Replace(html, pattern, “”, RegexOptions.IgnoreCase | RegexOptions.Multiline);
    return filtered;

Leave a Reply

Your email address will not be published. Required fields are marked *