【问题标题】:How to remove all tag except witelisted tag JSOUP如何删除除白名单标签 JSOUP 之外的所有标签
【发布时间】:2018-04-24 21:00:55
【问题描述】:
String html = "<video width='320' height='240' controls autoplay> <source src='movie.ogg' type='video/ogg'> <source src='movie.mp4' type='video/mp4'> <object data='movie.mp4' width='320' height='240'> <embed width='320' height='240' src='movie.swf'> </object></video><canvas id='myCanvas' width='200' height='100' style='border:1px solid #000000;'>Your browser does not support the HTML5 canvas tag.</canvas><article> <header> <h1>Internet Explorer 9</h1> <p><time pubdate datetime='2011-03-15'></time></p> </header> <p>Windows Internet Explorer 9 (abbreviated as IE9) was released to the public on March 14, 2011 at 21:00 PDT.....</p></article><footer> <p>Posted by: Hege Refsnes</p> <p>Contact information: <a href='mailto:someone@example.com'> someone@example.com</a>.</p></footer> <nav> <a href='/html/'>HTML</a> | <a href='/css/'>CSS</a> | <a href='/js/'>JavaScript</a> | <a href='/jquery/'>jQuery</a></nav> <section> <h1>WWF</h1> <p>The World Wide Fund for Nature (WWF) is....</p></section><datalist id='browsers'> <option value='Internet Explorer'> <option value='Firefox'> <option value='Chrome'> <option value='Opera'> <option value='Safari'></datalist> <audio controls> <source src='horse.ogg' type='audio/ogg'> <source src='horse.mp3' type='audio/mpeg'>Your browser does not support the audio element.</audio> <progress value='22' max='100'>teasdklfjashdfjkl</progress> ";
        String toDoRemoveTAG = "style,img,script,noscript,hr,input";
        String allowTagList = "p,span,b,i,u,div,br,a";
        Document doc = Jsoup.parse(html);
        Elements els = doc.select(toDoRemoveTAG);
        for (Element e : els)
        {
            e.remove();
        }

        Whitelist whitelist = new Whitelist();
        whitelist.addTags(allowTagList.split(","));
        whitelist.addAttributes("a", "href");
        Cleaner cleaner = new Cleaner(whitelist);
        doc = cleaner.clean(doc);

        System.out.println(doc.select("body").html());

我使用上述程序只允许列入白名单的标签并删除其他标签(甚至删除剥离的文本)。我想知道是否有任何 API 或 OOTB 解决方案可以实现相同的目标,我只需要传递白名单标签,函数将删除其他标签

我不想像以前那样手动执行此操作。

Elements els = doc.select(toDoRemoveTAG);
for (Element e : els)
{
  e.remove();
}

【问题讨论】:

    标签: java html parsing jsoup


    【解决方案1】:

    您可以将 jsoup HTML Cleaner 与白名单指定的配置一起使用。

    String unsafe =  "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
    String safe = Jsoup.clean(unsafe, Whitelist.basic());
    // now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
    

    【讨论】:

    • cleaner 仅包装未在 whitelist 中列出的其他标签的内容。我需要相同的功能来删除整个封闭标签,甚至不需要从该标签中删除文本。
    【解决方案2】:

    我们不能否定 toDoRemoveTAG,然后用它构建一个白名单并进行清理吗?我的意思是从文档中获取所有标签,然后通过删除 toDoRemoveTAG 中的所有标签和属性来构建白名单。

    我的意思是这样的。

    import java.util.Arrays;
    import java.util.HashMap;
    import java.util.HashSet;
    import java.util.List;
    import java.util.Map;
    import java.util.Map.Entry;
    import java.util.Set;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Attribute;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.safety.Cleaner;
    import org.jsoup.safety.Whitelist;
    import org.jsoup.select.Collector;
    import org.jsoup.select.Evaluator;
    
    public class MatrixMultiplication {
    
        public static void main(String[] args) throws Exception {
    
            String html = "<video width='320' height='240' controls autoplay> <source src='movie.ogg' type='video/ogg'> "
                    + "<source src='movie.mp4' type='video/mp4'> <object data='movie.mp4' width='320' height='240'> "
                    + "<embed width='320' height='240' src='movie.swf'> </object></video>"
                    + "<canvas id='myCanvas' width='200' height='100' style='border:1px solid #000000;'>"
                    + "Your browser does not support the HTML5 canvas tag.</canvas><article> <header> "
                    + "<h1>Internet Explorer 9</h1> <p><time pubdate datetime='2011-03-15'></time></p> "
                    + "</header> <p>Windows Internet Explorer 9 (abbreviated as IE9) was released to the public on March 14, 2011 at 21:00 PDT.....</p>"
                    + "</article><footer> <p>Posted by: Hege Refsnes</p> <p>Contact information: <a href='mailto:someone@example.com'> someone@example.com</a>.</p>"
                    + "</footer> <nav> <a href='/html/'>HTML</a> | <a href='/css/'>CSS</a> | <a href='/js/'>JavaScript</a> | "
                    + "<a href='/jquery/'>jQuery</a></nav> <section> <h1>WWF</h1> <p>The World Wide Fund for Nature (WWF) is....</p></section><datalist id='browsers'>"
                    + " <option value='Internet Explorer'> <option value='Firefox'> <option value='Chrome'> <option value='Opera'> <option value='Safari'></datalist>"
                    + " <audio controls> <source src='horse.ogg' type='audio/ogg'> <source src='horse.mp3' type='audio/mpeg'>Your browser does not support the audio element.</audio>"
                    + " <progress value='22' max='100'>teasdklfjashdfjkl</progress> ";
    
            String toDoRemoveTAG = "style,img,script,noscript,hr,input";
            String allowTagList = "p,span,b,i,u,div,br,a";
            Document doc = Jsoup.parse(html);
    
            Whitelist whitelist = buildWhiteList(doc, Arrays.asList(toDoRemoveTAG.toUpperCase().split(",")));
            Cleaner cleaner = new Cleaner(whitelist);
            doc = cleaner.clean(doc);
            System.out.println(doc.select("body").html());
        }
    
        private static Whitelist buildWhiteList(Document doc, List<String> toDoRemoveTAG) throws InstantiationException, IllegalAccessException {
            Whitelist whitelist = new Whitelist();
            Set<String> allowedTags = new HashSet<String>();
            Map<String, Set<String>> allowedAttributes = new HashMap<String, Set<String>>();
    
            for(Element e : Collector.collect(Evaluator.AllElements.class.newInstance(), doc)){
    
                if(!toDoRemoveTAG.contains(e.tagName().toUpperCase())){
                    allowedTags.add(e.tagName());
                    for(Attribute attr : e.attributes()){
                        if(!toDoRemoveTAG.contains(attr.getKey().toUpperCase())){
                            if(allowedAttributes.containsKey(e.tagName())){
                                allowedAttributes.get(e.tagName()).add(attr.getKey());
                            } else {
                                allowedAttributes.put(e.tagName(), new HashSet<String>() {{ add(attr.getKey()); }});
                            }
                        }
                    }
                }
            }
            whitelist.addTags(allowedTags.toArray(new String[allowedTags.size()]));
            for(Entry<String, Set<String>> e :  allowedAttributes.entrySet()){
                whitelist.addAttributes(e.getKey(), e.getValue().toArray(new String[e.getValue().size()]));
            }
            return whitelist;
        }
    
    }
    

    【讨论】:

    • @Saym 感谢您对此详细信息的回复..您使用toDoRemoveTAG 删除标签并允许其他..但我不想通过toDoRemoveTAG,因为即使我可以也有很长的列表不收集我要删除的所有标签。所以我想实现相反的行为,我只是通过allowTagList 删除所有其他标签..但我可以使用你的代码并更改条件以完美工作..
    • @HybrisHelp 你能按照上面的评论找出这个答案的allowTagList实现吗?