用python解析xpath答案

【问题标题】：Parsing xpath with python用python解析xpath
【发布时间】：2023-04-10 06:51:02
【问题描述】：

我正在尝试解析包含以下内容的网页：

<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
 <td colspan="2"
     style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>

（它以更多行继续并以 [/table] 结束

tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
    for elem in item.xpath('*'):
        if 'colspan' in html.tostring(elem):
                print '*', elem.text
        elif elem.text is not None:
            print elem.text,
        else:
            print

有点效果。它没有得到 [br /] 之后的文本，而且远非优雅。如何获取丢失的文本？此外，任何改进代码的建议都将不胜感激。

【问题讨论】：

标签： python xpath lxml lxml.html

【解决方案1】：

用.text_content()怎么样？

.text_content()：返回元素的文本内容，包括的文本内容它的孩子，没有标记。

table = tree.xpath('//table/tr')
for item in table:
    print ' '.join(item.text_content().split())

join()+split() 这里有助于用一个空格替换多个空格。

打印出来：

February 20, 2015
9:00 PM 14Â°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13Â°F
Clear Precip: 0 % Wind: from the WSW at 6 mph

由于您想将时间线与规则线合并，您可以迭代 tr 标记，但跳过文本中包含 Precip 的标记。对于每个时间线，获取以下 tr 兄弟以获取规则线：

table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
    text = ' '.join(item.text_content().split())
    if 'AM' in text or 'PM' in text:
        text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())

    print text

打印：

February 20, 2015
9:00 PM 14Â°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13Â°F Clear Precip: 0 % Wind: from the WSW at 6 mph

【讨论】：

好多了！有没有一种好方法可以识别一条线是日期线、时间线还是其他线（使用 xpath，不解析内容）？如果不出意外，我想将每条时间线与其明确的分界线合并。
@foosion 用于日期线 - 我将遵循 EAFP 原则并尝试使用 datetime.strptime() 加载内容并处理 ValueError - 如果没有错误 - 它是日期线。对于时间线，我认为您可以在内容中搜索 PM 或 AM 单词。看起来其他行以“Clear Precip”开头..
@foosion 让我给你一个样本，给我一分钟。
alecxe 我知道该怎么做。我希望有一种方法可以使用 xpath 而不是解析文本以查看它是日期还是时间或其他。例如，日期是 [td colspan="2"] 的一部分
@foosion 至于日期线，你是对的 - 我们可以检查是否有 colspan="2" 的 td 孩子，像这样：如果 item.xpath('.//td[@colspan="2"]'):