XPath 没有像我期望的那样工作答案

【问题标题】：XPath not working as I'd expect it toXPath 没有像我期望的那样工作
【发布时间】：2016-12-03 17:05:53
【问题描述】：

希望您在这里不需要整套代码，但我遇到了一个问题，即我使用 XPath 解析 HTML，但没有得到预期的结果：

# here is the current set of tags I'm interested in
 html = '''<div style="padding-top: 10px; clear: both; width: 100%;">
        <a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" ><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/communities/discussion_boards/comment-sm._CB192250344_.gif" width="16" alt="Comment" hspace="3" align="absmiddle" height="16" border="0" /></a>&nbsp;<a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" >Comment</a>&nbsp;|&nbsp;<a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_cr_rdp_perm" >Permalink</a>'''

我正在尝试获取第一个 a 标记的 href 值，这是一个长 URL。为此，我使用以下代码

from lxml import etree
import StringIO

parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(StringIO.StringIO(html), parser)

style = 'padding-top: 10px; clear: both; width: 100%;'
xpath = "//div[@style='%s']" % style
xpath += "/a[1]/@href"

# use the XPath expression above to pull out the href value
tree.xpath(xpath)


['http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful']

当我拉出我正在使用的部分并将其粘贴为字符串时，此方法有效。这与我使用request.get() 调用构建的tree 不完全相同，我不知道为什么？它返回的是：

['http://www.amazon.com/review/R41M1I2K413NG]

我不知道为什么。我知道我在这里是在黑暗中拍摄，但我只是希望有人遇到“属性被截断的 XPath 返回值”问题。

编辑：

这是我目前正在使用的完整代码，但它不起作用。它返回上面截断的值。

from lxml import etree
import requests
import StringIO
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter


session = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
session.mount('http://www.amazon.com', HTTPAdapter(max_retries=retries))
parser = etree.HTMLParser(encoding=encoding)

url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"
page = session.get(url, timeout=5)
tree = etree.parse(StringIO.StringIO(page.text), parser)

style = 'padding-top: 10px; clear: both; width: 100%;'
xpath = "//div[@style='%s']" % style
xpath += "/a[1]/@href"

# use the XPath expression above to pull out the href value
tree.xpath(xpath)

编辑 2：

出于某种原因，这确实有效。而不是创建session 对象并使用它来提交get 请求，然后将其传递给parser，只需将url 字符串传递给parser 即可：

url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"


tree = etree.parse(url, parser)



for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"):

    print e

据我了解，当循环遍历多个 url 时，会话对象将保持连接属性，从而加快进程。如果我使用etree.parse(url, parser) 方法，我担心会失去效率。

【问题讨论】：

我们如何重现这个？请向我们展示返回截断属性值的确切代码。
调用request.get()时使用的URL是什么？
amazon.com/gp/cdp/member-reviews/…
虽然你做的工作比需要的多，但两个代码块对我来说都可以正常工作，但它不起作用的唯一方法是由于一些编码问题，f 使用请求时永远不要调用 .text，始终使用.content 并让请求处理编码
@PadraicCunningham，感谢您的反馈。 “更多工作”是什么意思。我希望它尽可能精简 b/c 我有成千上万个类似的 url 可以抓取。关于编码，我还不确定，但肯定会在ref=.... 之后（包括）切断网址中的任何内容。

标签： python python-2.7 xpath python-requests lxml

【解决方案1】：

使用您提供的 URL，以下 Python 代码：

url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"

from lxml import etree
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(url, parser)

for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"):

    print e

结果如下：

> python ~/test.py 

http://www.amazon.com/review/RM8YYCQ57K2CL/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00J9PAZIO#wasThisHelpful
http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful
http://www.amazon.com/review/R3DT6VUDGIT9SK/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B000VYD0MA#wasThisHelpful
http://www.amazon.com/review/RGFW1JM4151MW/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00TQQN5G0#wasThisHelpful
http://www.amazon.com/review/R3I9FFX0MVF1BW/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0048A7NF8#wasThisHelpful
http://www.amazon.com/review/R24TTSQY34VME8/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0115ZHH68#wasThisHelpful
http://www.amazon.com/review/R3C49WWMNQZ007/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00ABAWHJ6#wasThisHelpful
http://www.amazon.com/review/R37724EHW829NB/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00TO5Y3FK#wasThisHelpful
http://www.amazon.com/review/RQKGM5FRXVYSX/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0051QUWKG#wasThisHelpful
http://www.amazon.com/review/R1DW61PMGUDMDJ/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B000N8Q2P6#wasThisHelpful

使用您提供的示例代码导致：

http://www.amazon.com/review/RM8YYCQ57K2CL
http://www.amazon.com/review/R41M1I2K413NG
http://www.amazon.com/review/R3DT6VUDGIT9SK
http://www.amazon.com/review/RGFW1JM4151MW
http://www.amazon.com/review/R3I9FFX0MVF1BW
http://www.amazon.com/review/R24TTSQY34VME8
http://www.amazon.com/review/R3C49WWMNQZ007
http://www.amazon.com/review/R37724EHW829NB
http://www.amazon.com/review/RQKGM5FRXVYSX
http://www.amazon.com/review/R1DW61PMGUDMDJ

这是因为session.get()返回的HTML页面中没有一个URL有任何GET参数；要么是因为在这种情况下服务器不返回带有 GET 参数的 URL，要么是因为 requests 去掉了 GET 参数。

【讨论】：

是的，这正是我正在做的事情......我必须以全新的眼光重新审视它。感谢您的帮助。
所以，当我使用etree.parse(url, parser) 时，它可以工作。但是，如果首先从 session.get(url) 获取 HTML，并传递 text 属性，如 etree.parse(page.text, parser)，那么我得到的结果不正确。我想使用session.get() b/c 它有助于保持请求之间的连接。