【问题标题】:HTML Agility Pack strip tags NOT IN whitelistHTML Agility Pack 条带标签不在白名单中
【发布时间】:2011-03-07 15:45:29
【问题描述】:

我正在尝试创建一个函数来删除不在白名单中的 html 标记和属性。 我有以下 HTML:

<b>first text </b>
<b>second text here
       <a>some text here</a>
 <a>some text here</a>

 </b>
<a>some twxt here</a>

我正在使用 HTML 敏捷包,目前我的代码是:

static List<string> WhiteNodeList = new List<string> { "b" };
static List<string> WhiteAttrList = new List<string> { };
static HtmlNode htmlNode;
public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList)
{

 // remove all attributes not on white list
 foreach (var item in pNode.ChildNodes)
 {
  item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => RemoveAttribute(u));

 }

 // remove all html and their innerText and attributes if not on whitelist.
 //pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
 //pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.ParentNode.ReplaceChild(ConvertHtmlToNode(u.InnerHtml),u));
 //pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());

 for (int i = 0; i < pNode.ChildNodes.Count; i++)
 {
  if (!pWhiteList.Contains(pNode.ChildNodes[i].Name))
  {
   HtmlNode _newNode = ConvertHtmlToNode(pNode.ChildNodes[i].InnerHtml);
   pNode.ChildNodes[i].ParentNode.ReplaceChild(_newNode, pNode.ChildNodes[i]);
   if (pNode.ChildNodes[i].HasChildNodes && !string.IsNullOrEmpty(pNode.ChildNodes[i].InnerText.Trim().Replace("\r\n", "")))
   {
    HtmlNode outputNode1 = pNode.ChildNodes[i];
    for (int j = 0; j < pNode.ChildNodes[i].ChildNodes.Count; j++)
    {
     string _childNodeOutput;
     RemoveNotInWhiteList(out _childNodeOutput,
          pNode.ChildNodes[i], WhiteNodeList, WhiteAttrList);
     pNode.ChildNodes[i].ReplaceChild(ConvertHtmlToNode(_childNodeOutput), pNode.ChildNodes[i].ChildNodes[j]);
     i++;
    }
   }
  }
 }

 // Console.WriteLine(pNode.OuterHtml);
 _output = pNode.OuterHtml;
}  

private static void RemoveAttribute(HtmlAttribute u)
{
 u.Value = u.Value.ToLower().Replace("javascript", "");
 u.Remove();

}

public static HtmlNode ConvertHtmlToNode(string html)
{
 HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
 doc.LoadHtml(html);
 if (doc.DocumentNode.ChildNodes.Count == 1)
  return doc.DocumentNode.ChildNodes[0];
 else return doc.DocumentNode;
}

我想达到的输出是

<b>first text </b>
<b>second text here
       some text here
 some text here

 </b>
some twxt here

这意味着我只想保留&lt;b&gt; 标签。
我这样做的原因是因为一些用户将 MS WORD 中的 cpoy-paste 粘贴到 ny WYSYWYG html 编辑器中。

谢谢!

【问题讨论】:

    标签: c# tags html-parsing html-agility-pack sanitize


    【解决方案1】:

    感谢您的代码 - 太棒了!!!!

    我做了一些优化...

    class TagSanitizer
    {
        List<HtmlNode> _deleteNodes = new List<HtmlNode>();
    
        public static void Sanitize(HtmlNode node)
        {
            new TagSanitizer().Clean(node);
        }
    
        void Clean(HtmlNode node)
        {
            CleanRecursive(node);
            for (int i = _deleteNodes.Count - 1; i >= 0; i--)
            {
                HtmlNode nodeToDelete = _deleteNodes[i];
                nodeToDelete.ParentNode.RemoveChild(nodeToDelete, true);
            }
        }
    
        void CleanRecursive(HtmlNode node)
        {
            if (node.NodeType == HtmlNodeType.Element)
            {
                if (Config.TagsWhiteList.ContainsKey(node.Name) == false)
                {
                    _deleteNodes.Add(node);
                }
                else if (node.HasAttributes)
                {
                    for (int i = node.Attributes.Count - 1; i >= 0; i--)
                    {
                        HtmlAttribute currentAttribute = node.Attributes[i];
    
                        string[] allowedAttributes = Config.TagsWhiteList[node.Name];
                        if (allowedAttributes != null)
                        {
                            if (allowedAttributes.Contains(currentAttribute.Name) == false)
                            {
                                node.Attributes.Remove(currentAttribute);
                            }
                        }
                        else
                        {
                            node.Attributes.Remove(currentAttribute);
                        }
                    }
                }
            }
    
            if (node.HasChildNodes)
            {
                node.ChildNodes.ToList().ForEach(v => CleanRecursive(v));
            }
        }
    }
    

    【讨论】:

    • 这一行的 Config 是什么? if (Config.TagsWhiteList.ContainsKey(node.Name) == false)
    • 这只是另一个列表,您可以随意更改:)
    • 附带说明,当我尝试这样做时,我遇到了结果标记不一致的问题(部分乱序,并非所有格式都被正确剥离),这可能是由于多线程优化递归。
    • 是的,这个 sn-p 不支持多任务处理
    • 到目前为止,这个答案对我有用。接受的答案在我部署它的服务器上的 StripHtml 方法中不断抛出“对象引用未设置为对象的实例”。事实证明这太难调试了,因为它不会在我的本地环境中引发错误。
    【解决方案2】:

    嘿,显然我几乎在某人的博客文章中找到了答案....

    using System.Collections.Generic;
    using System.Linq;
    using HtmlAgilityPack;
    
    namespace Wayloop.Blog.Core.Markup
    {
        public static class HtmlSanitizer
        {
            private static readonly IDictionary<string, string[]> Whitelist;
    
            static HtmlSanitizer()
            {
                Whitelist = new Dictionary<string, string[]> {
                    { "a", new[] { "href" } },
                    { "strong", null },
                    { "em", null },
                    { "blockquote", null },
                    };
            }
    
            public static string Sanitize(string input)
            {
                var htmlDocument = new HtmlDocument();
    
                htmlDocument.LoadHtml(input);
                SanitizeNode(htmlDocument.DocumentNode);
    
                return htmlDocument.DocumentNode.WriteTo().Trim();
            }
    
            private static void SanitizeChildren(HtmlNode parentNode)
            {
                for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) {
                    SanitizeNode(parentNode.ChildNodes[i]);
                }
            }
    
            private static void SanitizeNode(HtmlNode node)
            {
                if (node.NodeType == HtmlNodeType.Element) {
                    if (!Whitelist.ContainsKey(node.Name)) {
                        node.ParentNode.RemoveChild(node);
                        return;
                    }
    
                    if (node.HasAttributes) {
                        for (int i = node.Attributes.Count - 1; i >= 0; i--) {
                            HtmlAttribute currentAttribute = node.Attributes[i];
                            string[] allowedAttributes = Whitelist[node.Name];
                            if (!allowedAttributes.Contains(currentAttribute.Name)) {
                                node.Attributes.Remove(currentAttribute);
                            }
                        }
                    }
                }
    
                if (node.HasChildNodes) {
                    SanitizeChildren(node);
                }
            }
        }
    }
    

    I got HtmlSanitizer from here 显然它并没有去除标签,而是完全删除了元素。

    好的,这是以后需要的人的解决方案。

    public static class HtmlSanitizer
        {
            private static readonly IDictionary<string, string[]> Whitelist;
            private static List<string> DeletableNodesXpath = new List<string>();
    
            static HtmlSanitizer()
            {
                Whitelist = new Dictionary<string, string[]> {
                    { "a", new[] { "href" } },
                    { "strong", null },
                    { "em", null },
                    { "blockquote", null },
                    { "b", null},
                    { "p", null},
                    { "ul", null},
                    { "ol", null},
                    { "li", null},
                    { "div", new[] { "align" } },
                    { "strike", null},
                    { "u", null},                
                    { "sub", null},
                    { "sup", null},
                    { "table", null },
                    { "tr", null },
                    { "td", null },
                    { "th", null }
                    };
            }
    
            public static string Sanitize(string input)
            {
                if (input.Trim().Length < 1)
                    return string.Empty;
                var htmlDocument = new HtmlDocument();
    
                htmlDocument.LoadHtml(input);            
                SanitizeNode(htmlDocument.DocumentNode);
                string xPath = HtmlSanitizer.CreateXPath();
    
                return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath);
            }
    
            private static void SanitizeChildren(HtmlNode parentNode)
            {
                for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--)
                {
                    SanitizeNode(parentNode.ChildNodes[i]);
                }
            }
    
            private static void SanitizeNode(HtmlNode node)
            {
                if (node.NodeType == HtmlNodeType.Element)
                {
                    if (!Whitelist.ContainsKey(node.Name))
                    {
                        if (!DeletableNodesXpath.Contains(node.Name))
                        {                       
                            //DeletableNodesXpath.Add(node.Name.Replace("?",""));
                            node.Name = "removeableNode";
                            DeletableNodesXpath.Add(node.Name);
                        }
                        if (node.HasChildNodes)
                        {
                            SanitizeChildren(node);
                        }                  
    
                        return;
                    }
    
                    if (node.HasAttributes)
                    {
                        for (int i = node.Attributes.Count - 1; i >= 0; i--)
                        {
                            HtmlAttribute currentAttribute = node.Attributes[i];
                            string[] allowedAttributes = Whitelist[node.Name];
                            if (allowedAttributes != null)
                            {
                                if (!allowedAttributes.Contains(currentAttribute.Name))
                                {
                                    node.Attributes.Remove(currentAttribute);
                                }
                            }
                            else
                            {
                                node.Attributes.Remove(currentAttribute);
                            }
                        }
                    }
                }
    
                if (node.HasChildNodes)
                {
                    SanitizeChildren(node);
                }
            }
    
            private static string StripHtml(string html, string xPath)
            {
                HtmlDocument htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(html);
                if (xPath.Length > 0)
                {
                    HtmlNodeCollection invalidNodes = htmlDoc.DocumentNode.SelectNodes(@xPath);
                    foreach (HtmlNode node in invalidNodes)
                    {
                        node.ParentNode.RemoveChild(node, true);
                    }
                }
                return htmlDoc.DocumentNode.WriteContentTo(); ;
            }
    
            private static string CreateXPath()
            {
                string _xPath = string.Empty;
                for (int i = 0; i < DeletableNodesXpath.Count; i++)
                {
                    if (i != DeletableNodesXpath.Count - 1)
                    {
                        _xPath += string.Format("//{0}|", DeletableNodesXpath[i].ToString());
                    }
                    else _xPath += string.Format("//{0}", DeletableNodesXpath[i].ToString());
                }
                return _xPath;
            }
        }
    

    我重命名了节点,因为如果我必须解析 XML 命名空间节点,它会在 xpath 解析时崩溃。

    【讨论】:

    • HtmlSanitizer 的链接已损坏。这可能是 Meltdown 所指的代码:gist.github.com/814428
    • 这绝不是我创建白名单验证类的代码。原作者没有使用 RegEx。作者原代码是我贴的第一段代码。
    • 此代码不起作用,我可以轻松保存带有提交按钮的表单以及包含有害代码的脚本部分。
    • 请注意,DeletableNodesXpath 将始终随着上面的代码不断增长。它总是将"removableNode" 添加到列表中,并且永远不会匹配(因为它正在查看一个充满“removableNode”的列表)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-02-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多