【问题标题】:XPath to extract all text between two 'p' elements scrapyXPath 提取两个“p”元素之间的所有文本
【发布时间】:2022-01-16 17:01:37
【问题描述】:

我正在尝试使用 Scrapy 和 Splash 抓取数据库,这需要登录,所以很遗憾,我无法共享完整的网站。该数据库包含显示公司名称和简短描述的公司列表。

我正在努力寻找一个 XPath 表达式,它会产生两个“p”标签之间的所有文本,如下所示:

<p class="pre-wrap ng-binding"
ng-bind-html="object._source.startup.general_information.project_public_description"
ng-click="listView.showDetail(object)" role="button" tabindex="0">
  <div>With the vision of providing creative sustainable solutions for global food crisis,
    AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
    addressing the ever-growing demand for fish protein. Company’s additives improve both growth
    performance and feed utilization, enabling the <strong><em>growth of more fish with less
            feed</em></strong>. A unique peptide production system, enables large commercial
    scale production at significant lower cost and carbon footprint. Growing more fish with less
    feed also promote several SDG’s including the reduction of pressure on fish population in
    the sea, providing food security and reducing hunger and poverty, climate change and
    responsible production.&nbsp;</div>
</p>

所有公司描述都采用相同的格式(在两个“p”元素之间),但如 HTML 中所示,也有 &lt;strong&gt;&lt;em&gt; 元素。我想寻求帮助以找到一种方法来创建一个 XPath,它将所有文本包括 &lt;strong&gt;&lt;em&gt; 元素中的文本作为一个单独的文本块(这将是一个描述,当在网站上查看时没有分隔在正文中。

我尝试了以下方法,但只获取元素 //p[@class='pre-wrap ng-binding']//div//text() 之前的部分

我使用了以下代码:

'the descript': ''.join(startup.xpath('//div//text()').getall()),

【问题讨论】:

  • 请提供该页面的整个 HTML
  • //p[@class='pre-wrap ng-binding'] 的计算文本值应该是您所需要的。选择该元素并询问它的值。您也可以为div 这样做。
  • 显示您用于应用 XPath 的代码。您使用的是.xpath().get() 还是.xpath.getAll()
  • @Prophet,我尝试包含它,但是当我查看页面源代码时,它不包含我在此处添加的 HTML,如果您有办法,请告诉我
  • @MadsHansen 我正在使用 .xpath().get() 但在下面的答案之后我将其更改为 getAll()

标签: web-scraping xpath scrapy scrapy-splash


【解决方案1】:
scrapy shell

In [1]: html = """<html>
   ...: <body>
   ...: <p class="pre-wrap ng-binding"
   ...: ng-bind-html="object._source.startup.general_information.project_public_description"
   ...: ng-click="listView.showDetail(object)" role="button" tabindex="0">
   ...:   <div>With the vision of providing creative sustainable solutions for global food crisis,
   ...:     AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
   ...:     addressing the ever-growing demand for fish protein. Company’s additives improve both growth
   ...:     performance and feed utilization, enabling the <strong><em>growth of more fish with less
   ...:             feed</em></strong>. A unique peptide production system, enables large commercial
   ...:     scale production at significant lower cost and carbon footprint. Growing more fish with less
   ...:     feed also promote several SDG’s including the reduction of pressure on fish population in
   ...:     the sea, providing food security and reducing hunger and poverty, climate change and
   ...:     responsible production.&nbsp;</div>
   ...: </p>
   ...: </body>
   ...: </html>"""

In [2]: selector = scrapy.Selector(text=html)

In [3]: ''.join(selector.xpath('//div//text()').getall())
Out[3]: 'With the vision of providing creative sustainable solutions for global food crisis,\n    AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,\n    addressing the ever-growing demand for fish protein. Company’s additives improve both growth\n    performance and feed utilization, enabling the growth of more fish with less\n            feed. A unique peptide production system, enables large commercial\n    scale production at significant lower cost and carbon footprint. Growing more fish with less\n    feed also promote several SDG’s including the reduction of pressure on fish population in\n    the sea, providing food security and reducing hunger and poverty, climate change and\n    responsible production.\xa0'

【讨论】:

  • 谢谢,这对我有帮助,因为我终于从描述部分得到了一些文字。问题在于,这不仅抓取了我发布的 HTML 中的文本,而且抓取了太多的文本。你会不会碰巧知道为什么 //p//text() 只得到任何文本''?
  • 要解决您的问题,请将 div 标签的父级添加到 xpath 选择器,或者在 html 中添加 div 的位置。关于 //p//text(),根据 HTML 标准 div 不能在 p 内,因此如果您输入 selector.xpath('//p').get(),您会看到在 div 标记之前有一个 &lt;/p&gt;你实际上有:&lt;p...&gt;\n&lt;/p&gt;&lt;div&gt;...&lt;/div&gt;(至少这是我的理论)。
  • 不幸的是,即使我尝试了您的建议,它仍然无法正常工作。
  • @BerciVagyok 如果您分享到 url,我可以提供更多帮助。
  • 很遗憾,我无法提供它,因为它需要我无法提供的登录凭据:/ 感谢您的帮助!
猜你喜欢
  • 1970-01-01
  • 2021-10-09
  • 1970-01-01
  • 2012-06-07
  • 1970-01-01
  • 2012-01-01
  • 2020-06-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多