【问题标题】:Extracting comments from html using Jsoup使用 Jsoup 从 html 中提取评论
【发布时间】:2014-12-12 18:20:36
【问题描述】:

鉴于此 html 源页面,我正在尝试提取 cmets: 例如此页面中的第一条评论: “由 JDiff Javadoc doclet 生成” 我想提取此评论以及本文档中的所有其他评论。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<HTML style="overflow:auto;">
<HEAD>
<meta name="generator" content="JDiff v1.1.0">
<!-- Generated by the JDiff Javadoc doclet -->
<!-- (http://www.jdiff.org) -->
<meta name="description" content="JDiff is a Javadoc doclet which generates an HTML report of all the packages, classes, constructors, methods, and fields which have been removed, added or changed in any way, including their documentation, when two APIs are compared.">
<meta name="keywords" content="diff, jdiff, javadiff, java diff, java difference, API difference, difference between two APIs, API diff, Javadoc, doclet">
<TITLE>
All Removals Index
</TITLE>
<link href="../../../../assets/android-developer-docs.css" rel="stylesheet" type="text/css" />
<link href="../stylesheet-jdiff.css" rel="stylesheet" type="text/css" />
<noscript>
<style type="text/css">
body{overflow:auto;}
#body-content{position:relative; top:0;}
#doc-content{overflow:visible;border-left:3px solid #666;}
#side-nav{padding:0;}
#side-nav .toggle-list ul {display:block;}
#resize-packages-nav{border-bottom:3px solid #666;}
</style>
</noscript>
<style type="text/css">
</style>
</HEAD>
<BODY class="gc-documentation" style="padding:12px;">
<a NAME="topheader"></a>
<table summary="Index for All Differences" width="100%" class="jdiffIndex" border="0" cellspacing="0" cellpadding="0" style="padding-bottom:0;margin-bottom:0;">
  <tr>
  <th class="indexHeader">
    Filter the Index:
  </th>
  </tr>
  <tr>
  <td class="indexText" style="line-height:1.3em;padding-left:2em;">
<a href="alldiffs_index_all.html" xclass="hiddenlink">All Differences</a>
  <br>
<b>Removals</b>
  <br>
<A HREF="alldiffs_index_additions.html"xclass="hiddenlink">Additions</A>
  <br>
<A HREF="alldiffs_index_changes.html"xclass="hiddenlink">Changes</A>
  </td>
  </tr>
</table>
<div id="indexTableCaption" style="background-color:#eee;padding:0 4px 0 4px;font-size:11px;margin-bottom:.5em;">
Listed as: <span style="color:#069"><strong>Added</strong></span>,  <span style="color:#069"><strike>Removed</strike></span>,  <span style="color:#069">Changed</span></font>
</div>
<!-- Field CATEGORY_GADGET -->
<A NAME="C"></A>
<br><font size="+2">C</font>&nbsp;
<a href="#D"><font size="-2">D</font></a> 
<a href="#F"><font size="-2">F</font></a> 
<a href="#N"><font size="-2">N</font></a> 
<a href="#S"><font size="-2">S</font></a> 
 <a href="#topheader"><font size="-2">TOP</font></a>
<p><div style="line-height:1.5em;color:black">
<nobr><A HREF="android.content.Intent.html#android.content.Intent.CATEGORY_GADGET" class="hiddenlink" target="rightframe"><strike>CATEGORY_GADGET</strike></A>
</nobr><br>
<!-- Method dragViewToBottom -->
<A NAME="D"></A>
<br><font size="+2">D</font>&nbsp;
<a href="#C"><font size="-2">C</font></a> 
<a href="#F"><font size="-2">F</font></a> 
<a href="#N"><font size="-2">N</font></a> 
<a href="#S"><font size="-2">S</font></a> 
 <a href="#topheader"><font size="-2">TOP</font></a>
<p><div style="line-height:1.5em;color:black">
<nobr><A HREF="android.test.TouchUtils.html#android.test.TouchUtils.dragViewToBottom_removed(android.test.ActivityInstrumentationTestCase, android.view.View, int)" class="hiddenlink" target="rightframe"><strike>dragViewToBottom</strike>
(<code>ActivityInstrumentationTestCase, View, int</code>)</A></nobr><br>
<!-- Method forkAndSpecialize -->
<A NAME="F"></A>
<br><font size="+2">F</font>&nbsp;
<a href="#C"><font size="-2">C</font></a> 
<a href="#D"><font size="-2">D</font></a> 
<a href="#N"><font size="-2">N</font></a> 
<a href="#S"><font size="-2">S</font></a> 
 <a href="#topheader"><font size="-2">TOP</font></a>
<p><div style="line-height:1.5em;color:black">
<nobr><A HREF="dalvik.system.Zygote.html#dalvik.system.Zygote.forkAndSpecialize_removed(int, int, int[], boolean, int[][])" class="hiddenlink" target="rightframe"><strike>forkAndSpecialize</strike>
(<code>int, int, int[], boolean, int[][]</code>)</A></nobr><br>
<!-- Method forkSystemServer -->
<nobr><A HREF="dalvik.system.Zygote.html#dalvik.system.Zygote.forkSystemServer_removed(int, int, int[], boolean, int[][])" class="hiddenlink" target="rightframe"><strike>forkSystemServer</strike>
(<code>int, int, int[], boolean, int[][]</code>)</A></nobr><br>
<!-- Constructor NetworkInfo -->
<A NAME="N"></A>
<br><font size="+2">N</font>&nbsp;
<a href="#C"><font size="-2">C</font></a> 
<a href="#D"><font size="-2">D</font></a> 
<a href="#F"><font size="-2">F</font></a> 
<a href="#S"><font size="-2">S</font></a> 
 <a href="#topheader"><font size="-2">TOP</font></a>
<p><div style="line-height:1.5em;color:black">
<nobr><A HREF="android.net.NetworkInfo.html#android.net.NetworkInfo.ctor_removed(int)" class="hiddenlink" target="rightframe"><strike>NetworkInfo</strike>
(<code>int</code>)</A></nobr>&nbsp;constructor<br>
<!-- Method setButton -->
<A NAME="S"></A>
<br><font size="+2">S</font>&nbsp;
<a href="#C"><font size="-2">C</font></a> 
<a href="#D"><font size="-2">D</font></a> 
<a href="#F"><font size="-2">F</font></a> 
<a href="#N"><font size="-2">N</font></a> 
 <a href="#topheader"><font size="-2">TOP</font></a>
<p><div style="line-height:1.5em;color:black">
<i>setButton</i><br>
&nbsp;&nbsp;<nobr><A HREF="android.app.AlertDialog.html#android.app.AlertDialog.setButton_removed(java.lang.CharSequence, android.content.DialogInterface.OnClickListener)" class="hiddenlink" target="rightframe">type&nbsp;<strike>
(<code>CharSequence, OnClickListener</code>)</strike>&nbsp;in&nbsp;android.app.AlertDialog
</A></nobr><br>
<!-- Method setButton -->
&nbsp;&nbsp;<nobr><A HREF="android.app.AlertDialog.html#android.app.AlertDialog.setButton_removed(java.lang.CharSequence, android.os.Message)" class="hiddenlink" target="rightframe">type&nbsp;<strike>
(<code>CharSequence, Message</code>)</strike>&nbsp;in&nbsp;android.app.AlertDialog
</A></nobr><br>
<script src="//www.google-analytics.com/ga.js" type="text/javascript">
</script>
<script type="text/javascript">
  try {
    var pageTracker = _gat._getTracker("UA-5831155-1");
    pageTracker._setAllowAnchor(true);
    pageTracker._initData();
    pageTracker._trackPageview();
  } catch(e) {}
</script>
</BODY>
</HTML>

【问题讨论】:

    标签: java html jsoup


    【解决方案1】:

    我找到了一种使用 Jsoup 删除 cmets 的方法:https://gist.github.com/jhy/491407

    如果您查看此代码,您可能会准备 extractComments 方法。我试图实现这个功能并想出了这个:

    private List<Comment> getComments(Node node) {
        List<Comment> comments = new ArrayList<Comment>();
        int i = 0;
        while (i < node.childNodes().size()) {
            Node child = node.childNode(i);
            if (child.nodeName().equals("#comment"))
                comments.add((Comment) child);
            else {
                comments.addAll(getComments(child));
            }
            i++;
        }
        return comments;
    }
    

    示例用法:

    String page = "...."; //your page body
    Document doc = Jsoup.parse(page);
    List<Comment> comments = getComments(doc);
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-11-13
      • 1970-01-01
      • 2014-07-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多