Ahh, the power (and sometimes slight confusion) of regular expressions. This seems to work well to remove HTML from text:
string htmlstring = { some chunk o html infested text };
Regex cleanOutHtml =
new Regex(@”?\w+((\s+\w+(\s*=\s*(?:""(.|\n)*?""|'(.|\n)*?'|[^'"">\s]+))?)+\s*|\s*)/?>”);
string onlytext = cleanOutHtml.Replace(htmlstring, “”);
Weee there you have it.
Here’s another way to do it that lets you have a “white-list” of acceptable html:
string filtered = Regex.Replace(html, pattern, "", RegexOptions.IgnoreCase | RegexOptions.Multiline);
return filtered;