Scrapy 0.24.5中两个节点之间的Xpath选择节点答案

【问题标题】：Xpath select nodes between two nodes in scrapy 0.24.5Scrapy 0.24.5中两个节点之间的Xpath选择节点
【发布时间】：2015-09-11 00:49:10
【问题描述】：

<h3>Q1</h3>  
<p><p>text1</p></p><a name="1"> </a>  
<p>...</p>  
...  
<ul><li>...</li></ul>
<h3>Q2</h3>  
<p>text2</p><a name="2"> </a>  
<p>...</p>  
...  
<ul><li>...</li></ul>
<h3>Q3</h3>  
<p>text3</p>
<p>...</p>  
...  
<ul><li>...</li></ul>

上面是我的 html，我想获取单个 h3 的文本和它后面的节点的文本，直到下一个 h3。换句话说，如果我将它们放入字典中，结果将如下所示：

{Q1:text1, Q2:text2, Q3:text3}

我尝试先选择所有 h3 标签，然后遍历 h3 标签列表。对于每个 h3 标签，我尝试选择下一个 h3 标签之前的所有节点。这是我的代码：

>>> h3_tags = response.xpath(".//h3")   
>>> for h3_tag in h3_tags:    
>>>     texts = h3_tag.xpath("./following-sibling::node()[count(preceding-sibling::h3)=1]/descendant-or-self::text()").extract()

但这只会提取第一个 h3 标签之后的 p 文本（此外它还包括第二个 h3 标签的文本），其余的 h3 标签我什么也没得到。

如果我使用：

>>> h3_tags = response.xpath(".//h3")   
>>> for h3_tag in h3_tags:    
>>>     texts = h3_tag.xpath("./following-sibling::node()[preceding-sibling::h3]/descendant-or-self::text()").extract()

第二个和第三个 h3 的前一个 p 有多余的文本。

我在 Scrapy 0.24.5 中使用它，这是我的第一天。任何帮助表示赞赏！

【问题讨论】：

stackoverflow.com/questions/30629183/…
嗨 splash58，感谢您的链接。我相应地编辑了我的代码：texts = h3_tag.xpath("./following-sibling::*[preceding-sibling::h3[1][contains(.,title)] and not (name()='h3')]/descendant-or-self::text()").extract() 但是对于第一个和第二个 h3，我得到了属于下一个 h3 的冗余内容。如果我在条件中添加(count(preceding-sibling::h3)=1)，我只会得到第一个 h3 标签的内容，而不是后面的内容。

标签： python html xpath web-scraping scrapy

【解决方案1】：

在enumerate() 的帮助下，您仍然可以使用count(preceding-sibling...) 技术

>>> for cnt, h3 in enumerate(selector.xpath('.//h3'), start=1):
...     print h3.xpath('./following-sibling::node()[count(preceding-sibling::h3)=%d]' % cnt).extract()
... 
[u'  \n', u'<p></p>', u'<p>text1</p>', u'<a name="1"> </a>', u'  \n', u'<h3>Q2</h3>']
[u'  \n', u'<p>text2</p>', u'<a name="2"> </a>', u'  \n', u'<h3>Q3</h3>']
[u'  \n', u'<p>text3</p>']
>>> 
>>> for cnt, h3 in enumerate(selector.xpath('.//h3'), start=1):
...     print h3.xpath('./following-sibling::node()[count(preceding-sibling::h3)=%d]/descendant-or-self::text()' % cnt).extract()
... 
[u'  \n', u'text1', u' ', u'  \n', u'Q2']
[u'  \n', u'text2', u' ', u'  \n', u'Q3']
[u'  \n', u'text3']
>>>

请注意，<p><p>text1</p></p> 与 lxml 配合不佳，在 p 中创建了 2 个兄弟 ps 而不是 p

【讨论】：

我认为count(preceding-sibling::h3)=1 会消除那些不是当前 h3 直接兄弟的 h3 跟随兄弟。知道为什么这不起作用吗？
例如，对于第 2 个 h3，follow-siblings 至少有 2 个之前的 h3。请记住，兄弟姐妹在树中处于同一级别。您只需要计算构成“边界”的前面元素