【问题标题】:Remove all HTML with Jsoup but keep the lines使用 Jsoup 删除所有 HTML 但保留行
【发布时间】:2012-12-08 22:25:13
【问题描述】:

我有一个String,其中包含电子邮件的一些内容,我想从这个String 中删除所有HTML 编码。

这是我目前的代码:

public static String html2text(String html) {

   Document document = Jsoup.parse(html);
   document = new Cleaner(Whitelist.basic()).clean(document);
   document.outputSettings().escapeMode(EscapeMode.xhtml);
   document.outputSettings().charset("UTF-8");
   html = document.body().html();

   html = html.replaceAll("<br />", "");

   splittedStr = html.split("Geachte heer/mevrouw,");

   html = splittedStr[1];

   html = "Geachte heer/mevrouw,"+html;

   return html;
}

此方法删除所有 HTML,保留行和大部分布局。但它也会返回一些&amp;amp;nbsp; 标签,这些标签并未完全删除。请参阅下面的输出,您可以看到String 中仍有一些标签甚至部分标签。我该如何摆脱这些?

  Loonheffingen       &amp;n= bsp; Naam
 nr         in administratie         &amp;nbs= p;           meldingen
  nummer

 1          &amp;n= bsp;            = ;     0            &amp;= nbsp;           &amp;nbs= p;           1
           123456789L01

编辑:

<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">De afgekeurde meldingen zijn opgenomen in de bijlage: Afgekeurde meldingen.</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">

<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Wilt u zo spoedig mogelijk zorgdragen dat deze</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">meldingen gecorrigeerd worden aangeleverd?</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">mer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Volg &nbsp; &nbsp; Aantal verwerkt &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Aantal afgekeurde</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">&nbsp;Loonheffingen &nbsp; &nbsp; &nbsp; &nbsp; Naam</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">nr &nbsp; &nbsp; &nbsp; &nbsp; in administratie &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; meldingen</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">&nbsp;nummer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"><span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">

这是我要解析的 HTML 的一部分。我想删除所有 HTML,但保留原始电子邮件的布局。

感谢您的帮助,

谢谢!

已解决

        Document xmlDoc = Jsoup.parse(file, "", Parser.xmlParser());
        Elements spans= xmlDoc.select("span");

        for (Element link : spans) {
            String html = textPlus(link);
            System.out.println(html);
        }


 public static String textPlus(Element elem) {
    List<TextNode> textNodes = elem.textNodes();
    if (textNodes.isEmpty()) {
        return "";
    }

    StringBuilder result = new StringBuilder();
    // start at the first text node
    Node currentNode = textNodes.get(0);
    while (currentNode != null) {
        // append deep text of all subsequent nodes
        if (currentNode instanceof TextNode) {
            TextNode currentText = (TextNode) currentNode;
            result.append(currentText.text());
        } else if (currentNode instanceof Element) {
            Element currentElement = (Element) currentNode;
            result.append(currentElement.text());
        }
        currentNode = currentNode.nextSibling();
    }
    return result.toString();
}

代码是作为this问题的答案提供的。

【问题讨论】:

    标签: java html regex jsoup


    【解决方案1】:

    您需要遍历 JSoup 返回的 HTML 结构并整理文本节点,而不是这样做。这样一来,您就可以让 JSoup 确定真正的文本,然后为您处理实体编码(例如 &amp;amp; -> &amp; 等)。

    请参阅this SO question 了解更多信息。

    【讨论】:

    • 感谢您的回答!一个小问题,我不知道我应该搜索哪些元素。我试图获取所有 span 元素,但它没有返回任何内容。看看我的帖子,我已经用我试图解析的 HTML 的一部分对其进行了编辑。
    猜你喜欢
    • 2014-08-09
    • 1970-01-01
    • 2012-08-17
    • 1970-01-01
    • 2021-07-19
    • 1970-01-01
    • 2013-06-06
    • 2011-05-13
    相关资源
    最近更新 更多