如何使用scrapy选择下一个节点答案

【问题标题】：How to select next node using scrapy如何使用scrapy选择下一个节点
【发布时间】：2013-11-15 01:25:45
【问题描述】：

我的 html 看起来像这样：

<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>

我了解如何使用来自 h1 的 scrapy 信息进行提取：

content.select("//h1[contains(text(),'Text 1')]/text()").extract()

但我的目标是从<div>Some info</div>中提取内容

我的问题是我没有关于 div 的任何具体信息。据我所知，它正好在<h1>Text 1</h1> 之后。我可以使用选择器在树中获取 NEXT 元素吗？元素，位于 DOM 树的同一层？

类似：

a = content.select("//h1[contains(text(),'Text 1')]/text()")
a.next("//div/text()").extract()
Some info

【问题讨论】：

标签： python html parsing dom scrapy

【解决方案1】：

试试这个xpath:

//h1[contains(text(), 'Text 1')]/following-sibling::div[1]/text()

【讨论】：

【解决方案2】：

使用following-sibling。来自https://www.w3.org/TR/2017/REC-xpath-31-20170321/

following-sibling 轴包含上下文节点的后续兄弟，即上下文节点的父级的子级，按文档顺序出现在上下文节点之后；

例子：

from scrapy.selector import Selector
text = '''
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
'''
sel = Selector(text=text)
h1s = sel.xpath('//h1/text()')
for counter, h1 in enumerate(h1s,1):
    div = sel.xpath('(//h1)[{}]/following-sibling::div[1]/text()'.format(counter))
    print(h1.get())
    print(div.get())

输出是：

Text 1
Some info
Text 2
...

【讨论】：