C#: Regular Expressions to Filter HTML to a Whitelist of Allowable Tags

If you are looking to “sanitize” your HTML to a whitelist of allowable tags, here’s a bit of code that may help. It is a string extension that uses regular expressions to “clean up” your HTML input. The original code was found here: http://refactormycode.com/codes/333-sanitize-html – so please do visit the site for caveats and alternatives.

/// <summary>
/// Filters HTML to the valid html tags set (with only the attributes specified)
/// Thanks to http://refactormycode.com/codes/333-sanitize-html for the original
/// </summary>
public static class HtmlSanitizeExtension
    private const string HTML_TAG_PATTERN = @"(?'tag_start'</?)(?'tag'\w+)((\s+(?'attr'(?'attr_name'\w+)(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+)))?)+\s*|\s*)(?'tag_end'/?>)";
    /// <summary>
    /// A dictionary of allowed tags and their respectived allowed attributes.  If no
    /// attributes are provided, all attributes will be stripped from the allowed tag
    /// </summary>
    public static Dictionary<string, List<string>> ValidHtmlTags = new Dictionary<string, List<string>> {
            { "p", new List<string>() },
            { "strong", new List<string>() }, 
            { "ul", new List<string>() }, 
            { "li", new List<string>() }, 
            { "a", new List<string> { "href", "target" } }
    /// <summary>
    /// Extension filters your HTML to the whitelist specified in the ValidHtmlTags dictionary
    /// </summary>
    public static string FilterHtmlToWhitelist(this string text)
        Regex htmlTagExpression = new Regex(HTML_TAG_PATTERN, RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);
        return htmlTagExpression.Replace(text, m =>
            if (!ValidHtmlTags.ContainsKey(m.Groups["tag"].Value))
                return String.Empty;
            StringBuilder generatedTag = new StringBuilder(m.Length);
            Group tagStart = m.Groups["tag_start"];
            Group tagEnd = m.Groups["tag_end"];
            Group tag = m.Groups["tag"];
            Group tagAttributes = m.Groups["attr"];
            generatedTag.Append(tagStart.Success ? tagStart.Value : "<");
            foreach (Capture attr in tagAttributes.Captures)
                int indexOfEquals = attr.Value.IndexOf('=');
                // don't proceed any futurer if there is no equal sign or just an equal sign
                if (indexOfEquals < 1)
                string attrName = attr.Value.Substring(0, indexOfEquals);
                // check to see if the attribute name is allowed and write attribute if it is
                if (ValidHtmlTags[tag.Value].Contains(attrName))
                    generatedTag.Append(' ');
            generatedTag.Append(tagEnd.Success ? tagEnd.Value : ">");
            return generatedTag.ToString();

Here’s how to use the extension:

string rawHtml = "<p>this is some text<br/><span>this is moretext</span><br/></p>";
string filteredHtml = rawHtml.FilterHtmlToWhitelist();
This entry was posted in C#, Extension Methods, Regular Expressions. Bookmark the permalink.

1 Response to C#: Regular Expressions to Filter HTML to a Whitelist of Allowable Tags

  1. Mickael says:

    You forgot a XSS weakness… For example, if I have a string with “Hello World ! ANCHORSTART href=”alert(‘NOOB’)” rel=”nofollow”>Good Bye ANCHOREND”, you’ll have a beautiful alert…

Leave a Reply

Your email address will not be published. Required fields are marked *