从 URL 获取 HTTP 响应代码的最佳方法是什么？答案

【问题标题】：What’s the best way to get an HTTP response code from a URL?从 URL 获取 HTTP 响应代码的最佳方法是什么？
【发布时间】：2009-07-16 22:27:54
【问题描述】：

我正在寻找一种从 URL（即 200、404 等）获取 HTTP 响应代码的快速方法。我不确定要使用哪个库。

【问题讨论】：

标签： python

【解决方案1】：

使用美妙的requests library 更新。请注意，我们使用的是 HEAD 请求，它应该比完整的 GET 或 POST 请求发生得更快。

import requests
try:
    r = requests.head("https://stackoverflow.com")
    print(r.status_code)
    # prints the int of the status code. Find more at httpstatusrappers.com :)
except requests.ConnectionError:
    print("failed to connect")

【讨论】：

requests 比 urllib2 好得多，对于这样的链接：dianping.com/promo/208721#mod=4，urllib2 给我一个 404，请求给我一个 200，就像我从浏览器得到的一样。
httpstatusrappers.com...太棒了！我的代码处于 Lil Jon 状态，儿子！
这是最好的解决方案。比其他任何人都好。
@WKPlus 作为记录，现在requests 为您的链接提供403，尽管它仍在浏览器中工作。
@Gourneau 哈！这不是我的评论的意图，我认为这很好，在这种情况下，人们应该尝试理解为什么它在浏览器中“正常工作”，但在代码中返回 403，而实际上，相同这两个地方都在发生。

【解决方案2】：

这是一个使用httplib 的解决方案。

import httplib

def get_status_code(host, path="/"):
    """ This function retreives the status code of a website by requesting
        HEAD data from the host. This means that it only requests the headers.
        If the host cannot be reached or something else goes wrong, it returns
        None instead.
    """
    try:
        conn = httplib.HTTPConnection(host)
        conn.request("HEAD", path)
        return conn.getresponse().status
    except StandardError:
        return None


print get_status_code("stackoverflow.com") # prints 200
print get_status_code("stackoverflow.com", "/nonexistant") # prints 404

【讨论】：

HEAD 请求 +1 — 无需检索整个实体进行状态检查。
尽管您确实应该将 except 块限制为至少 StandardError，这样您就不会错误地捕获像 KeyboardInterrupt 这样的东西。
我想知道 HEAD 请求是否可靠。因为网站可能没有（正确）实现 HEAD 方法，这可能会导致状态码如 404、501 或 500。或者我是偏执狂？
如何让这个遵循 301？
@Blaise 如果网站不允许 HEAD 请求，那么执行 HEAD 请求应该会导致 405 错误。例如，尝试运行curl -I http://www.amazon.com/。

【解决方案3】：

你应该使用 urllib2，像这样：

import urllib2
for url in ["http://entrian.com/", "http://entrian.com/does-not-exist/"]:
    try:
        connection = urllib2.urlopen(url)
        print connection.getcode()
        connection.close()
    except urllib2.HTTPError, e:
        print e.getcode()

# Prints:
# 200 [from the try block]
# 404 [from the except block]

【讨论】：

这不是一个有效的解决方案，因为 urllib2 会跟随重定向，所以你不会得到任何 3xx 响应。
@sorin：这取决于 - 你可能想要关注重定向。也许你想问“如果我用浏览器访问这个 URL，它会显示内容还是给出错误？”在这种情况下，如果我在示例中将 http://entrian.com/ 更改为 http://entrian.com/blog，即使涉及重定向到 http://entrian.com/blog/，生成的 200 也是正确的（注意尾部斜杠）。

【解决方案4】：

以后，对于那些使用 python3 和更高版本的人，这里有另一个代码来查找响应代码。

import urllib.request

def getResponseCode(url):
    conn = urllib.request.urlopen(url)
    return conn.getcode()

【讨论】：

这将为 404、500 等状态代码引发 HTTPError。

【解决方案5】：

urllib2.HTTPError 异常不包含getcode() 方法。请改用code 属性。

【讨论】：

它适用于我，使用 Python 2.6。

【解决方案6】：

解决@Niklas R 对@nickanor 回答的评论：

from urllib.error import HTTPError
import urllib.request

def getResponseCode(url):
    try:
        conn = urllib.request.urlopen(url)
        return conn.getcode()
    except HTTPError as e:
        return e.code

【讨论】：

【解决方案7】：

这是一个 httplib 解决方案，其行为类似于 urllib2。你可以给它一个 URL，它就可以工作。无需将您的 URL 拆分为主机名和路径。这个函数已经做到了。

import httplib
import socket
def get_link_status(url):
  """
    Gets the HTTP status of the url or returns an error associated with it.  Always returns a string.
  """
  https=False
  url=re.sub(r'(.*)#.*$',r'\1',url)
  url=url.split('/',3)
  if len(url) > 3:
    path='/'+url[3]
  else:
    path='/'
  if url[0] == 'http:':
    port=80
  elif url[0] == 'https:':
    port=443
    https=True
  if ':' in url[2]:
    host=url[2].split(':')[0]
    port=url[2].split(':')[1]
  else:
    host=url[2]
  try:
    headers={'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0',
             'Host':host
             }
    if https:
      conn=httplib.HTTPSConnection(host=host,port=port,timeout=10)
    else:
      conn=httplib.HTTPConnection(host=host,port=port,timeout=10)
    conn.request(method="HEAD",url=path,headers=headers)
    response=str(conn.getresponse().status)
    conn.close()
  except socket.gaierror,e:
    response="Socket Error (%d): %s" % (e[0],e[1])
  except StandardError,e:
    if hasattr(e,'getcode') and len(e.getcode()) > 0:
      response=str(e.getcode())
    if hasattr(e, 'message') and len(e.message) > 0:
      response=str(e.message)
    elif hasattr(e, 'msg') and len(e.msg) > 0:
      response=str(e.msg)
    elif type('') == type(e):
      response=e
    else:
      response="Exception occurred without a good error message.  Manually check the URL to see the status.  If it is believed this URL is 100% good then file a issue for a potential bug."
  return response

【讨论】：

不知道为什么这在没有反馈的情况下被否决。它适用于 HTTP 和 HTTPS URL。它使用 HTTP 的 HEAD 方法。

【解决方案8】：

依赖多个工厂，但尝试测试这些方法：

import requests

def url_code_status(url):
    try:
        response = requests.head(url, allow_redirects=False)
        return response.status_code
    except Exception as e:
        print(f'[ERROR]: {e}')

或：

import http.client as httplib
import urllib.parse

def url_code_status(url):
    try:
        protocol, host, path, query, fragment = urllib.parse.urlsplit(url)
        if protocol == "http":
            conntype = httplib.HTTPConnection
        elif protocol == "https":
            conntype = httplib.HTTPSConnection
        else:
            raise ValueError("unsupported protocol: " + protocol)
        conn = conntype(host)
        conn.request("HEAD", path)
        resp = conn.getresponse()
        conn.close()
        return resp.status
    except Exception as e:
        print(f'[ERROR]: {e}')

100 个网址的基准测试结果：

第一种方法：20.90秒
第二种方法：23.15秒

【讨论】：