如何从 download.file 请求中捕获 HTTP 错误代码？答案

【问题标题】：How do I capture the HTTP error code from a download.file request?如何从 download.file 请求中捕获 HTTP 错误代码？
【发布时间】：2018-12-17 21:01:41
【问题描述】：

此代码尝试下载不存在的页面：

url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")

这会返回 404 错误：

trying URL 'https://en.wikipedia.org/asdfasdfasdf'
Error in download.file(url, destfile = "output.html", method = "libcurl") : 
  cannot open URL 'https://en.wikipedia.org/asdfasdfasdf'
In addition: Warning message:
In download.file(url, destfile = "output.html", method = "libcurl") :
  cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'

但code 变量仍然包含 0，尽管download.file 的文档指出返回值是：

一个（不可见的）整数代码，0 表示成功，非零表示失败。对于“wget”和“curl”方法，这是外部程序返回的状态码。 “内部”方法可以返回 1，但在大多数情况下会抛出错误。

如果我使用curl或wget作为下载方法，结果是一样的。我在这里想念什么？是调用warnings() 并解析输出的唯一选择吗？

我已经看到 other questions 关于使用 download.file，但没有一个（我能找到）实际检索 HTTP 状态代码。

【问题讨论】：

我不知道 R，也不知道 download.file 包装器，但是获取代码的底层 libcurl 方法是 long response_code; curl_easy_getinfo(ch,CURLINFO_RESPONSE_CODE,&response_code); - 检查您的 download.file api 是否以某种方式暴露了 libcurl 的 curl_easy_getinfo()

标签： r http curl wget

【解决方案1】：

可能最好的选择是直接使用 cURL 库，而不是通过 download.file 包装器，它不会公开 cURL 的全部功能。我们可以做到这一点，例如，使用 RCurl 包（尽管其他包如 httr 或系统调用也可以实现相同的功能）。直接使用 cURL 将允许您访问 cURL 信息，包括响应代码。例如：

library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404

虽然上面的第一个选项更简洁，但如果您真的想改用download.file，一种可能的方法是使用withCallingHandlers捕获警告

try(withCallingHandlers( 
  download.file(url, destfile = "output.html", method = "libcurl"),
  warning = function(w) {
    my.warning <<- sub(".+HTTP status was ", "", w)
    }),
  silent = TRUE)

cat(my.warning)
'404 Not Found'

【讨论】：

【解决方案2】：

如果您不介意使用其他方法，可以尝试 GET 包中的 GET：

url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"

# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200

# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404

^{由reprex package (v0.2.1) 于 2019-01-02 创建}

【讨论】：