使用 xpath 从表中抓取和提取数据答案

【问题标题】：crawl and extract data from a table with xpath使用 xpath 从表中抓取和提取数据
【发布时间】：2017-05-27 10:33:50
【问题描述】：

我正在浏览城市 Wiki 页面，需要提取城市所属的国家/地区。我试图找到包含“国家”一词的<th>，然后返回<tr>，然后在<td> 中找到它，但问题有几种情况。

（我的第一个案例的代码）

a = doc.xpath("//table[contains(@class, 'infobox')]")
b = a[0].xpath("//table//th[contains(text(),'Country') or contains(text(),'country')]")
country = b[0].xpath("./../td//a//text()")[0].replace(" ", "_")

我知道为什么它不适用于其他情况，但我不知道如何解决它。

关键字“国家”在<th>

<tr class="mergedtoprow">
      <th scope="row">Country</th>
      <td>
        <a href="/wiki/Poland" title="Poland">Poland</a>
      </td>
</tr>

关键字“国家”在<a> <span> <th> 中

` Constituent country England

    <tr class="mergedrow">
      <th scope="row">
       <span class="nowrap">
        <a href="/wiki/Countries_of_the_United_Kingdom" title="Countries of the 
         United Kingdom">Constituent country
        </a>
       </span>
      </th>
      <td>
       <span class="flagicon"><img alt="" src="SRC (never mind)" width="23" 
       height="14" class="thumbborder" srcset="SRC (never mind)" />&#160;
       </span>
       <a href="/wiki/England" title="England">England</a>
      </td>
    </tr>

关键字“国家”在<a>，在<th>

 

       <tr class="mergedrow">
          <th scope="row">
            <a href="/wiki/Countries_of_the_United_Kingdom" title="Countries of the United  Kingdom">Country
            </a>
          </th>
          <td>England</td>
        </tr>

【问题讨论】：

“维基页面”？如果你指的是维基百科，你为什么不使用维基数据？
这是大学作业
当然，我认为这是一个糟糕的问题 :)

标签： html xpath web-crawler

【解决方案1】：

在所有提到的情况下，您可以使用下面的XPath 来匹配所需的th 元素：

//th[matches(normalize-space(), "country", "i")]

注意"i" 标志允许进行不区分大小写的搜索，因此“Country”和“country”都应该匹配

如果你的工具只支持XPath 1.0你可以使用

//th[contains(.,'Country') or contains(.,'country')]

【讨论】：