C#: Regular Expressions to Filter HTML to a Whitelist of Allowable Tags

If you are looking to “sanitize” your HTML to a whitelist of allowable tags, here’s a bit of code that may help. It is a string extension that uses regular expressions to “clean up” your HTML input. The original code was found here: http://refactormycode.com/codes/333-sanitize-html – so please do visit the site for caveats and alternatives.

 
/// <summary>
/// Filters HTML to the valid html tags set (with only the attributes specified)
/// 
/// Thanks to http://refactormycode.com/codes/333-sanitize-html for the original
/// </summary>
public static class HtmlSanitizeExtension
{
    private const string HTML_TAG_PATTERN = @"(?'tag_start'</?)(?'tag'\w+)((\s+(?'attr'(?'attr_name'\w+)(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+)))?)+\s*|\s*)(?'tag_end'/?>)";
 
    /// <summary>
    /// A dictionary of allowed tags and their respectived allowed attributes.  If no
    /// attributes are provided, all attributes will be stripped from the allowed tag
    /// </summary>
    public static Dictionary<string, List<string>> ValidHtmlTags = new Dictionary<string, List<string>> {
            { "p", new List<string>() },
            { "strong", new List<string>() }, 
            { "ul", new List<string>() }, 
            { "li", new List<string>() }, 
            { "a", new List<string> { "href", "target" } }
    };
 
    /// <summary>
    /// Extension filters your HTML to the whitelist specified in the ValidHtmlTags dictionary
    /// </summary>
    public static string FilterHtmlToWhitelist(this string text)
    {
        Regex htmlTagExpression = new Regex(HTML_TAG_PATTERN, RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);
 
        return htmlTagExpression.Replace(text, m =>
        {
            if (!ValidHtmlTags.ContainsKey(m.Groups["tag"].Value))
                return String.Empty;
 
            StringBuilder generatedTag = new StringBuilder(m.Length);
 
            Group tagStart = m.Groups["tag_start"];
            Group tagEnd = m.Groups["tag_end"];
            Group tag = m.Groups["tag"];
            Group tagAttributes = m.Groups["attr"];
 
            generatedTag.Append(tagStart.Success ? tagStart.Value : "<");
            generatedTag.Append(tag.Value);
 
            foreach (Capture attr in tagAttributes.Captures)
            {
                int indexOfEquals = attr.Value.IndexOf('=');
 
                // don't proceed any futurer if there is no equal sign or just an equal sign
                if (indexOfEquals < 1)
                    continue;
 
                string attrName = attr.Value.Substring(0, indexOfEquals);
 
                // check to see if the attribute name is allowed and write attribute if it is
                if (ValidHtmlTags[tag.Value].Contains(attrName))
                {
                    generatedTag.Append(' ');
                    generatedTag.Append(attr.Value);
                }
            }
 
            generatedTag.Append(tagEnd.Success ? tagEnd.Value : ">");
 
            return generatedTag.ToString();
        });
    }
}

Here’s how to use the extension:

 
string rawHtml = "<p>this is some text<br/><span>this is moretext</span><br/></p>";
 
string filteredHtml = rawHtml.FilterHtmlToWhitelist();
This entry was posted in C#, Extension Methods, Regular Expressions. Bookmark the permalink.

2 Responses to C#: Regular Expressions to Filter HTML to a Whitelist of Allowable Tags

  1. Mickael says:

    You forgot a XSS weakness… For example, if I have a string with “Hello World ! ANCHORSTART href=”alert(‘NOOB’)” rel=”nofollow”>Good Bye ANCHOREND”, you’ll have a beautiful alert…

  2. Caraxson says:

    Not sure if this is correct…I updated the code to address Mickael’s point. Also added the ValidHtmlTags[tag.Value] != null check so that we don’t have to create list instance for every entry.

    if (ValidHtmlTags[tag.Value] != null &&
    ValidHtmlTags[tag.Value].Contains(attrName))
    {
    generatedTag.Append(‘ ‘);
    generatedTag.Append(attrName);
    generatedTag.Append(“=\””);

    var attrValue = attr.Value.Substring(indexOfEquals + 1);
    attrValue = HttpUtility.HtmlEncode(attrValue.Trim(‘\”, ‘”‘));

    generatedTag.Append(attrValue);
    generatedTag.Append(“\””);
    }

Leave a Reply

Your email address will not be published. Required fields are marked *