更新:你确定你的问题出在这个 xpath 上吗?您是否确认它没有早于或晚于此 xpath 失败?我不太确定如何使用 scrapy 运行刮擦,所以我只是手动运行 XML 解析,然后在真实文档上运行以下内容,测试文档对我有用。
first.xml 仅包含 xpath 及其父结构:
<websiteInformation>
<MasterPage>
<Containers>
<xpath>.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']</xpath>
</Containers>
</MasterPage>
</websiteInformation>
并解析first.xml:
from lxml import etree
doc = etree.parse(open('first.xml'))
containers = []
containersFromXML = doc.findall('MasterPage/Containers/xpath')
for oneXpath in containersFromXML:
print oneXpath.text
containers.append(oneXpath.text)
输出:
.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
看起来不错。
test.html 是:
<html>
<body>
<div id="results-list">
<div class="item paid-featured-item">
<div class="listing-item">Found A</div>
</div>
<div class="item paid-featured-item">
<div class="listing-item">Found B</div>
</div>
</div>
</body>
</html>
然后搜索它:
from scrapy.selector import Selector
sel = Selector(text=open('test.html').read())
for container in containers:
print "Xpath: {}".format(container)
result = sel.xpath(container)
print "Container: {}".format(len(result))
for elem in result:
print elem
输出:
Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 2
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found A</div>'>
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found B</div>'>
使用wget 输出搜索得到的真实 URL 的结果:
Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 25
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n \n '>
# omitted 23
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n \n '>
看起来您的 xpath 字符串在不应该出现的地方有额外的单引号 (')。在 XML 中它看起来像:
<xpath>''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''</xpath>
解析时会显示(如您打印时所示):
''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''
你不想要周围的's。应该是这样的:
.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]
如果您可以编辑包含 xpath 的 XML 文件,请从每个 <xpath> 中删除前导 '&apos; 和尾随 &apos;'。所以:
<Containers>
<xpath>''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''</xpath>
</Containers>
应该变成:
<Containers>
<xpath>.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]</xpath>
</Containers>
但如果由于某种原因无法编辑 XML 文件,则在获得 xpath 文本后,将其周围的 's 剥离。所以:
containers.append(oneXpath.text)
应该变成:
containers.append(oneXpath.text.strip("'"))