【问题标题】:Exact XPATH location within JavaJava 中的确切 XPATH 位置
【发布时间】:2013-05-02 06:57:23
【问题描述】:

我正在尝试返回准确的 XPATH 查询表达式,以便我可以使用 rapidminer 对站点进行数据挖掘。 我需要一个查询来单独隔离每一行:

2012 年 7 月 11 日星期三

巨魔

9999999999999

07.11.12

通知文件已提交

2012 年 11 月 20 日星期二下午 1:12

到目前为止,我只有//td[@class='select']/text()

注意:值会改变,因此查询需要特定于位置。

对于每个值,六个单独的查询是什么?

        <tr>
          <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
            Wed 7/11/2012<br>
            TROLL&nbsp;
            
          </td>
          <td class="select" align="center" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
            9999999999999
            <br>07.11.12
            
            &nbsp;
          </td>
          <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
             
              
              
                      
                CONNOTE FILE LODGED <br>
                Tue 20/11/2012 1:12 PM
              &nbsp;
            
            
            
&nbsp;
          </td>
          
        </tr>
      
    </table>

【问题讨论】:

    标签: xpath rapidminer


    【解决方案1】:

    使用 Ruby 库 Nokogiri(位于 libxml2 之上,实现 XPath 1.0)进行测试:

    XPATHS = %w{
      //tr/td[1]/text()[1]
      //tr/td[1]/text()[2]
      //tr/td[2]/text()[1]
      //tr/td[2]/text()[2]
      //tr/td[3]/text()[1]
      //tr/td[3]/text()[2]
    }
    
    require 'nokogiri'
    d = Nokogiri.HTML(html)
    
    XPATHS.each{ |expression| p d.at_xpath(expression).content }
    #=> "\n            Wed 7/11/2012"
    #=> "\n            TROLL\u00A0\n\n          "
    #=> "\n            9999999999999\n            "
    #=> "07.11.12\n\n            \u00A0\n          "
    #=> "\n\n\n\n\n                CONNOTE FILE LODGED "
    #=> "\n                Tue 20/11/2012 1:12 PM\n              \u00A0\n\n\n\n\u00A0\n          "
    

    如您所见,文本节点包含许多额外的前导和尾随空格,您可能想去掉这些空格。我们可以使用normalize-space 去除它:

    XPATHS = %w{
      normalize-space(//tr/td[1]/text()[1])
      normalize-space(//tr/td[1]/text()[2])
      normalize-space(//tr/td[2]/text()[1])
      normalize-space(//tr/td[2]/text()[2])
      normalize-space(//tr/td[3]/text()[1])
      normalize-space(//tr/td[3]/text()[2])
    }
    
    XPATHS.each{ |expression| p d.xpath(expression) }
    #=> "Wed 7/11/2012"
    #=> "TROLL\u00A0"
    #=> "9999999999999"
    #=> "07.11.12 \u00A0"
    #=> "CONNOTE FILE LODGED"
    #=> "Tue 20/11/2012 1:12 PM \u00A0 \u00A0"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-08-10
      • 1970-01-01
      • 2017-12-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多