urlopen 返回有效链接的重定向错误答案

【问题标题】：urlopen Returning Redirect Error for Valid Linksurlopen 返回有效链接的重定向错误
【发布时间】：2015-09-14 16:52:34
【问题描述】：

我正在 python 中构建一个断开的链接检查器，它正在成为一个苦差事，构建用于正确识别在浏览器访问时无法解析的链接的逻辑。我找到了一组链接，在这些链接中，我可以使用我的刮刀持续重现重定向错误，但在浏览器中访问时可以完美解决。我希望我能在这里找到一些见解。

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    response = urllib.request.urlopen(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)


print(output)

在这种情况下，可靠地返回此错误的 URL 示例是“http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html”。访问时完美解决，但上面的代码会返回以下错误：

HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

有什么想法可以在不盲目忽略该站点的链接（可能会错过真正损坏的链接）的情况下正确识别这些链接的功能吗？

【问题讨论】：

标签： python-3.x httprequest urllib

【解决方案1】：

您会收到无限循环错误，因为您要抓取的页面使用 cookie 并在客户端未发送 cookie 时重定向。当您禁止使用 cookie 时，大多数其他抓取工具和浏览器都会出现同样的错误。

您需要一个http.cookiejar.CookieJar 和一个urllib.request.HTTPCookieProcessor 来避免重定向循环：

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    cj = CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    response = opener.open(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)
    print(output)

【讨论】：

出于某种原因，this 在python2 上为我工作，但不是你的答案。我继续使用此方法收到 302 无限循环错误。有什么建议吗？
当 Python 从版本 2 切换到 3 时，一些模块被重命名。您需要将 urllib 替换为 urllib2，将 http.cookiejar 替换为 cookielib 等。我在几分钟前对其进行了测试上面问题中提到的 url，它适用于 Python 2.7
可能不太清楚，但我的意思是我可以打开带有链接的python2 方法的URL，但不能用你的方法。也就是说，我认为我们可以忽略，我正在尝试的网站可能会因为在太短的时间内太多的拉取而阻止了我（甚至无法在浏览器中访问它）。今天回来了，还在工作。我想我很困惑。什么不起作用。谢谢！

【解决方案2】：

我同意第一个答案中的 cmets，但它对我不起作用（我得到了一些编码/压缩的字节数据，没有任何可读性）

提到的链接使用了 urllib2。它也适用于 python 3.7 中的 urllib，如下所示：

from urllib.request import build_opener, HTTPCookieProcessor
opener = build_opener(HTTPCookieProcessor())
response = opener.open('http://www.bad.org.uk')
print response.read()

【讨论】：

对于 Python 3，如果我今天这样做，我可能会使用请求库中的工具：2.python-requests.org/en/master/user/quickstart/#cookies

【解决方案3】：

我尝试了上面的解决方案，但没有成功。

当您尝试打开的 URL 格式不正确（或者不是 REST 服务所期望的）时，似乎会出现此问题。例如，我发现我的问题是因为我请求了https://host.com/users/4484486，而主机期望在末尾添加一个斜线：https://host.com/users/4484486/ 解决了问题。

【讨论】：

如果我今天再试一次，我会使用 requests 库和适合该工具的 cookie 函数：2.python-requests.org/en/master/user/quickstart/#cookies