Scrapy 中的 XPATH 有没有办法在返回之前将来自同一个父级的子级的文本组合成一个元素？答案

【问题标题】：XPATH in Scrapy Is there a way to combine text from the children of the same parent into a single element before it is returned?Scrapy 中的 XPATH 有没有办法在返回之前将来自同一个父级的子级的文本组合成一个元素？
【发布时间】：2015-01-29 05:25:44
【问题描述】：

我正在使用 Python 的 Scrapy 进行一些网络抓取，并且我试图在下面的 html 中获取我最后一个 tr 的最后一个 td 中的文本。

<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
    <tr style="vertical-align: top;">
        <td> <b>Name</b> </td>
        <td> Abraham Lincoln
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Sex</b> </td>
        <td> Male
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Occupation </b>
        </td>
        <td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
            <br />Former President of the United States
        </td>
    </tr>
</table>

目前，我已经在我的scrapy的解析函数中写了这个。

def parse(self, response):
    sel = Selector(response)
    data = sel.xpath("//table[@class='infobox']")
    occupation = data.xpath("tr[td/b[contains(.,'Occupation')]]/td[position()>1]/text()").extract()
    print occupation

打印出来的结果是：

[u' Former King of ', u',', u'Former President of the United States\n']

我真正想得到的是……类似于（最重要的变化是火星被添加到前国王）：

[u'Former King of Mars', u'Former President of the United States']

我知道 | union 在 xpath 中，我本可以在职业中编写更多内容来捕获 a 标签中的“Mars”文本，但是，我希望能够将 a 标签文本与 td 文本连接起来以输出“前火星之王”作为打印列表的元素之一。我认为通过联合，火星会在列表中显示为它自己的元素，这不是我所需要的。无论如何，我希望在 xpath 中有某种方式可以加入父 td 的子文本，以便我可以将“火星前国王”作为输出列表的一个元素。此外，td 中可能有多个 a 标签，例如......“King”也可能在 a 标签内。另一个要求是将“美国前总统”保留为一个单独的元素（以某种方式识别 br 标签？）。我不确定处理这些情况的最佳方法是什么，但我认为如果有办法在 xpath 中做到这一点，它会比在 python 中使用列表更好，因为 xpath 仍然引用 dom 树.你们有什么感想？谢谢！

【问题讨论】：

请参考此链接。 stackoverflow.com/questions/19309960/…
“Mars”和“King”不是孩子和父母，他们是兄弟姐妹，还有“Former”和<br />。您可以将它们作为列表tr[td/b[contains(.,'Occupation')]]/td[position()>1]//text() 获取，但您必须自己拆分它们。
请尝试td[position()>1]/descendant::text()
@AvinashRaj 我还没有尝试过，但我已经写了一堆scrapy，所以我希望我能坚持下去。如果 xpath 没有给我我想要的，我将不得不给 bs4 一个机会。
@JoelM.Lamsen 返回[u' Former King of ', u'Mars', u',', u'Former President of the United States\n'] 但现在我有了这个列表，很难写出逻辑来识别火星应该与前国王（而不是其他东西）有关)

标签： python html xpath web-scraping scrapy

【解决方案1】：

通过 BeautifulSoup，我会在下面做。

>>> import re
>>> from bs4 import BeautifulSoup
>>> s = """<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
    <tr style="vertical-align: top;">
        <td> <b>Name</b> </td>
        <td> Abraham Lincoln
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Sex</b> </td>
        <td> Male
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Occupation </b>
        </td>
        <td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
            <br />Former President of the United States
        </td>
    </tr>
</table>"""
>>> soup = BeautifulSoup(s)
>>> tr = soup.find_all('tr')[-1]
>>> td = tr.find_all('td')[-1]
>>> x = re.split(r',?\n\s*', td.text)
>>> [i for i in x if i]
[' Former King of Mars', 'Former President of the United States']

【讨论】：

【解决方案2】：

试试这个：

def parse(self, response):
    sel = Selector(response)
    data = sel.xpath("//table[@class='infobox']")
    occupation = data.xpath("normalize-space(tr[td/b[contains(.,'Occupation')]]/td[position()>1])").extract()
    print occupation

这将返回删除换行符的 td 元素的字符串值。

根据规范：

元素节点的字符串值是元素节点的所有文本节点后代的字符串值文件顺序。

【讨论】：

【解决方案3】：

你可以试试这个 xpath：

concat(//tr[td/b[contains(.,'Occupation')]]/td[position() &gt; 1]/descendant::text()[following::br], //tr[td/b[contains(.,'Occupation')]]/td[position() &gt; 1]/descendant::text()[preceding::br])

【讨论】：