【问题标题】:Java Android xPath html parsingJava Android xPath html 解析
【发布时间】:2011-11-04 09:46:06
【问题描述】:

我有一个需要获取 html 并在其中获取一些标签的应用程序。

我需要获取所有 tr 和所有 td,并获取它们的内部文本。

你能给我一个代码吗?

我已经在这几个小时工作了……

网站内容是:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">    
<!-- Updated: 03/11/2011 15:17:29-->    
<html xmlns="http://www.w3.org/1999/xhtml" >    
<head><title>    
    Untitled Page    
</title><meta http-equiv="Page-Exit" content="progid:DXImageTransform.Microsoft.GradientWipe(duration=1)" /><meta HTTP-EQUIV="CACHE-CONTROL" content="NO-CACHE" /><meta HTTP-EQUIV="PRAGMA" content="NO-CACHE" /><meta http-equiv="refresh" content="60" />    
    <style type="text/css">                
    .DisplayTable { width: 97%; }    
    .DisplayHeader { font-family: Arial; font-weight: bold; font-size: 25px; color: Black; text-align: center; }    
    .DisplayCell { font-family: Arial; font-weight: bold; font-size: 16px; color: Black; }                
    .MessageTable { width: 97%; }    
    .MessageHeader { font-family: Arial; font-size: 20px; color: SteelBlue; border-bottom: solid 3px SteelBlue; }    
    .MessageText { font-family: Arial; font-size: 20px; color: SteelBlue; text-align: right; }                
    .DisplayFillChange { font-family: Arial; font-weight: bold; font-size: 16px; color: MediumBlue; background-color: LightCyan; border-bottom: solid 1px LightCyan; }    
    .DisplayFreeChange { font-family: Arial; font-weight: bold; font-size: 16px; color: OrangeRed; background-color: LightCyan; border-bottom: solid 1px LightCyan; }    
    .DisplayEventChange { font-family: Arial; font-weight: bold; font-size: 16px; color: DarkGreen; background-color: LightCyan; border-bottom: solid 1px LightCyan; }    
    .DisplayExamChange { font-family: Arial; font-weight: bold; font-size: 16px; color: IndianRed; background-color: LightCyan; border-bottom: solid 1px LightCyan; }                
    </style>    
</head>    
<body dir="rtl" style="margin: 0px; background-color: LightCyan; overflow: hidden;" scroll="no" onload="resize()">    
    <form name="form1" method="post" action="MainScreen.aspx?pid=17&amp;mid=6264&amp;page=5&amp;msgof=0&amp;static=1" id="form1">    
<div>    
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJLTQwMjA0MzQzZGSqqj0xDnBRKxIgowwhNZzzyzQHVg==" />    
</div>            
        <table width="100%" cellspacing="0" cellpadding="0" border="0" style="background-image: url(fill.gif);">    
            <tr height="59" style="font-family: Arial; font-size: 34px; color: Yellow; vertical-align: middle;">    
                <td width="15">&nbsp;</td>    
                <td width="45%" align="right" id="clock">00:00</td>    
                <td align="center" nowrap><b>שינוי מערכת שעות לתאריך                        </b></td>    
                <td width="45%" align="left">04.11.2011</td>    
                <td width="15">&nbsp;</td>    
            </tr>    
        </table>    
        <br />    
        <div id="header" align="center"><table width='100%' class='DisplayTable' cellspacing='0' border='1'><tr class='DisplayHeader'><td width='1%' style='color: LightCyan;'>0</td><td width='14%'>יא - 1</td><td width='14%'>יא - 2</td><td width='14%'>יא - 3</td><td width='14%'>יא - 4</td><td width='14%'>יא - 5</td><td width='14%'>יא - 6</td><td width='14%'>יא - 7</td><td width='1%' style='color: LightCyan;'>0</td></tr></table></div>    
        <div id="scrollPanel" align="center" style="overflow: hidden;">    
            <div id="panel" align="center" style=""><table width='100%' class='DisplayTable' cellspacing='0' border='1'><tr><td width='1%' class='DisplayCell'>0</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>0</td></tr><tr><td width='1%' class='DisplayCell'>1</td><td width='14%' class='DisplayCell'><table width='100%'></table></td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>1</td></tr><tr><td width='1%' class='DisplayCell'>2</td><td width='14%' class='DisplayCell'><table width='100%'></table></td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>2</td></tr><tr><td width='1%' class='DisplayCell'>3</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>3</td></tr><tr><td width='1%' class='DisplayCell'>4</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>4</td></tr><tr><td width='1%' class='DisplayCell'>5</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>5</td></tr><tr><td width='1%' class='DisplayCell'>6</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>6</td></tr><tr><td width='1%' class='DisplayCell'>7</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>7</td></tr><tr><td width='1%' class='DisplayCell'>8</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>8</td></tr><tr><td width='1%' class='DisplayCell'>9</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>9</td></tr></table></div>    
            <div id="messages" align="center"><table width='100%' class='MessageTable' cellspacing='0' cellpadding='7' border='0'><tr><td class='MessageHeader'>הודעות</td></tr></tr></table></div>    
        </div>    
    </form>    
    <script>                
    var sp;    
    var delay = 0;                
    function resize(){    
        sp = document.getElementById('scrollPanel');    
        sp.style.height = document.documentElement.clientHeight - sp.offsetTop;            
        delay = document.getElementById('panel').clientHeight - document.getElementById('scrollPanel').clientHeight;    
        if (delay > 0)    
            delay = delay / 5 * 120;    
        else    
            delay = 0;                    
        setTimeout("doScroll()", 3000);    
        setTimeout("doNextPage()", 500);    
    }                
    function doScroll()    
    {    
        sp.scrollTop += 5;    
        setTimeout("doScroll()", 100);    
    }                
    updateClock();    
    function nextUrl()    
    {    
        return 'MainScreen.aspx?pid=17&mid=6264&page=6&msgof=0&nd=0';    
    }                
    function doNextPage()    
    {                    
    }                
    function updateClock()    
    {    
        document.getElementById('clock').innerHTML = getClock();    
        setTimeout("updateClock()", 55000)    
    }
    function getClock()    
    {    
        var date = new Date();    
        var hours = date.getHours();    
        var minutes = date.getMinutes();                    
        if (hours < 10)    
            hours = '0' + hours;                        
        if (minutes < 10)    
            minutes = '0' + minutes;            
        return hours + ':' + minutes;    
    }    
    </script>    
</body>    
</html>

【问题讨论】:

    标签: java android html xpath html-table


    【解决方案1】:

    最简单的方法是使用 HTML 解析库,例如HTMLCleaner、TagSoup、HTML Parser 等。这样您就可以简单地从文档中获取所有需要的元素,或者使用“节点访问者”手动迭代它 - 或任何库调用它。

    快速查看上面随机选择的库的documentation,表明类似以下内容应该适用于 HTMLCleaner:

    HtmlCleaner cleaner = new HtmlCleaner();
    TagNode root= cleaner.clean(...);
    TagNode[] trNodes= root.getElementsByName("tr");
    for (TagNode trNode : trNodes) {
        System.out.println("All text inside this <tr> tag (including children): " + trNode.getText());
    }
    

    使用相同库的示例,但现在使用 TagNodeVisitor 并在 &lt;td&gt; 上过滤:

    node.traverse(new TagNodeVisitor() {
        public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
            if (htmlNode instanceof TagNode) {
                TagNode tag = (TagNode) htmlNode;
                String tagName = tag.getName();
                if ("td".equals(tagName)) {
                    System.out.println("All text inside this <td> tag (including children): " + tag.getText());
                }
            }
            // tells visitor to continue traversing the DOM tree
            return true;
        }
    });
    

    【讨论】:

    • 嗯,这很好,但问题是我只需要表格中的一列。我的意思是我需要进入每个 tr,例如到 2nd td。我的朋友告诉我 xPath 非常适合它,因为我可以选择带索引的 td。因此,我要求提供 xPath 解决方案。因此,如果您给我 xPath 代码,或者您告诉我您的方法也可以使用索引,我会很高兴。谢谢!!!
    • 我想您可以简单地在循环内调用 getChildTags() 以在 &lt;tr&gt; 内获得第二个 &lt;td&gt;for (TagNode trNode : trNodes) { TagNode tdNode = trNode.getChildTags[1]; } 或者,xPath 评估似乎也(部分)支持使用evaluateXPath(String)。请参阅TagNode 文档。请注意,这假设第二个孩子将是 &lt;td&gt; - 您可能需要明确检查。 //编辑:抱歉,格式似乎不正确。
    • 好吧,我已经调查过了,现在我遇到了另一个问题......我确实设法获得了所有带索引的 td,但我需要找到的特定 td 实际上包含一个 TABLE (非常愚蠢,但我的应用程序使用了一个已经建立的网站)。当出现这种情况时:&lt;td width='14%' class='DisplayCell'&gt;&lt;table width='100%'&gt;&lt;tr&gt;&lt;td class='DisplayFillChange'&gt;Some Text Here...&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt; 我没有设法创建代码来获取文本,但是在调试模式下我尝试进入这些标签,但最终我无法获取文本,就像不是存在...
    猜你喜欢
    • 1970-01-01
    • 2014-05-10
    • 2015-07-28
    • 2014-07-06
    • 1970-01-01
    • 2019-11-17
    • 1970-01-01
    • 2014-10-10
    • 2012-07-12
    相关资源
    最近更新 更多