刮板给出空白输出答案

【问题标题】：Scraper giving blank output刮板给出空白输出
【发布时间】：2018-04-04 07:25:10
【问题描述】：

我在我的 python 脚本中使用了一个选择器来从下面给出的一些 html 元素中获取文本。我尝试使用.text 从元素中获取Shop here cheap 字符串，但它根本不起作用。但是，当我尝试使用 .text_content() 时，它可以正常工作。

我的问题是：

.text 方法有什么问题？为什么它不能从元素中解析文本？

HTML 元素：

<div class="Price__container">
    <span class="ProductPrice" itemprop="price">$6.35</span>
    <span class="ProductPrice_original">$6.70</span>
    Shop here cheap
</div>

我尝试了什么：

from lxml import html

tree = html.fromstring(element)
for data in tree.cssselect(".Price__container"):      
    print(data.text)           #It doesn't work at all

顺便说一句，我不希望继续使用.text_content()，这就是为什么我希望得到任何答案来使用.text 来抓取文本。提前致谢。

【问题讨论】：

标签： python python-3.x web-scraping css-selectors

【解决方案1】：

我认为混淆的根本原因是lxml 有这个.text&.tail concept 表示节点的内容，它避免了必须有一个特殊的“文本”节点实体，引用documentation：

.text 和 .tail 这两个属性足以表示 XML 文档中的任何文本内容。这样，除了 Element 类之外，ElementTree API 不需要任何特殊的文本节点，这些节点往往会经常妨碍（正如您可能从经典 DOM API 中知道的那样）。

在您的情况下，Shop here cheap 是 <span class="ProductPrice_original">$6.70</span> 元素的尾部，因此不包含在父节点的 .text 值中。

除了其他方法，如.text_content()，您可以通过非递归获取所有顶级文本节点来到达尾部：

print(''.join(data.xpath("./text()")).strip())

或者，获取最后一个顶级文本节点：

print(data.xpath("./text()[last()]")[0].strip())

【讨论】：

感谢您提供清晰有效的解决方案。

【解决方案2】：

另一种方法可能是打击：

content="""
<div class="Price__container">
    <span class="ProductPrice" itemprop="price">$6.35</span>
    <span class="ProductPrice_original">$6.70</span>
    Shop here cheap
</div>
"""
from lxml import html

tree = html.fromstring(content)
for data in tree.cssselect(".Price__container"):
    for item in data:item.drop_tree()
    print(data.text.strip())

输出：

Shop here cheap

【讨论】：