【发布时间】:2015-09-14 16:52:34
【问题描述】:
我正在 python 中构建一个断开的链接检查器,它正在成为一个苦差事,构建用于正确识别在浏览器访问时无法解析的链接的逻辑。我找到了一组链接,在这些链接中,我可以使用我的刮刀持续重现重定向错误,但在浏览器中访问时可以完美解决。我希望我能在这里找到一些见解。
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
response = urllib.request.urlopen(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
在这种情况下,可靠地返回此错误的 URL 示例是“http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html”。访问时完美解决,但上面的代码会返回以下错误:
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
有什么想法可以在不盲目忽略该站点的链接(可能会错过真正损坏的链接)的情况下正确识别这些链接的功能吗?
【问题讨论】:
标签: python-3.x httprequest urllib