当 download.file 耗时过长时跳到下一个迭代答案

【问题标题】：Skip to the next iteration when download.file takes too long当 download.file 耗时过长时跳到下一个迭代
【发布时间】：2018-10-11 02:59:45
【问题描述】：

我一直在尝试跳过 download.file 的迭代，该迭代耗时太长并且无法相应地工作，尽管我尝试了一些类似的 answers 来解决我的问题。我在下面使用我一直在使用的代码设置了一个示例。我的主要问题是我用来提取 .csv 文件的一些 ID（来自下面的 vec 对象）没有相关的 .csv 文件，并且 URL 无法正常工作——我相信它会一直在尝试URL 直到它得到响应，它没有，并且循环开始花费太长时间。如果download.file 开始花费太长时间，我如何跳过 ID？

library(stringr)
library(R.utils)    

vec=c("05231992000181","00628708000191","05816554000185", "01309949000130","07098414000144", "07299568000102", "12665438000178", "63599658000181", "12755123000111", "12376766000154",
      "11890564000163", "04401095000106", "11543768000128", "10695634000160", "34931022000197", "10422225000190",
      "09478854000152", "12682106000100", "11581441000140", "10545688000149", "10875891000183", "13095498000165",
      "10809607000170", "07976466000176", "11422211000139", "41205907000174", "08326720000153", "06910908000119",
      "04196935000227", "02323120000155", "96560701000154")


for (i in seq_along(vec)) {

  url = paste0("http://compras.dados.gov.br/licitacoes/v1/licitacoes.csv?cnpj_vencedor=", vec[i])

  tryCatch(expr = {evalWithTimeout(download.file(url, 
                                                 destfile = paste0("C:/Users/Username/Desktop/example_file/",vec[i],".csv"),  
                                                 mode="wb"), timeout=3)},
           error=function(ex) cat("Timeout. Skipping.\n"))

  print(i)
}

【问题讨论】：

一个比超时更好的主意是使用httr，这样您就可以检查the HTTP status，因为您几乎肯定会因为不存在的文件而获得失败状态。
哪个函数？我已经尝试过http_error，它似乎确实是足够的......
取决于你想要做什么。我可能会使用GET 而不是download.file，然后只需提取响应的status 部分，如果它很好，就调用content。 warn_for_status 可能会做同样的事情，尽管我以前没有使用过。不过，我不太确定您为什么要尝试抛出错误——它们会停止您的代码，而您并不想这样做。
我认为它不起作用-从它的外观来看，它仍然需要很多时间才能获得status和GET。
嗯，显然服务器出了点问题。您可以在调用上设置超时并仍然捕获错误，例如lapply(paste0('http://compras.dados.gov.br/licitacoes/v1/licitacoes.csv?cnpj_vencedor=', c("01309949000130", "07098414000144")), function(x) tryCatch(httr::GET(x, httr::timeout(3)), error = function(e) NULL))

标签： r loops time

【解决方案1】：

如果可能，HTTP 状态是处理这种情况的有效方法，但如果服务器没有响应，您可以使用httr::timeout 设置超时，并传递给httr::GET。通过 tidyverse 将所有内容保存在整洁的数据框列表列中，

library(dplyr)
library(purrr)

base_url <- "http://compras.dados.gov.br/licitacoes/v1/licitacoes.csv?cnpj_vencedor="
df <- data_frame(cnpj_vencedor = c("05231992000181", "00628708000191", "05816554000185", "01309949000130","07098414000144", "07299568000102", "12665438000178", "63599658000181", "12755123000111", "12376766000154", "11890564000163", "04401095000106", "11543768000128", "10695634000160", "34931022000197", "10422225000190", "09478854000152", "12682106000100", "11581441000140", "10545688000149", "10875891000183", "13095498000165","10809607000170", "07976466000176", "11422211000139", "41205907000174", "08326720000153", "06910908000119", "04196935000227", "02323120000155", "96560701000154")) 

df <- df %>% 
    # iterate GET over URLs, modified by `purrr::safely` to return a list of 
    # the result and the error (NULL where appropriate), with timeout set
    mutate(response = map(paste0(base_url, cnpj_vencedor), 
                          safely(httr::GET), httr::timeout(3)))

df <- df %>% 
           # extract response (drop errors)
    mutate(response = map(response, 'result'),
           # where there is a response, extract its data 
           data = map_if(response, negate(is.null), httr::content))

df
#> # A tibble: 31 x 3
#>    cnpj_vencedor  response       data              
#>    <chr>          <list>         <list>            
#>  1 05231992000181 <S3: response> <tibble [49 × 18]>
#>  2 00628708000191 <S3: response> <NULL>            
#>  3 05816554000185 <S3: response> <tibble [1 × 18]> 
#>  4 01309949000130 <S3: response> <NULL>            
#>  5 07098414000144 <NULL>         <NULL>            
#>  6 07299568000102 <NULL>         <NULL>            
#>  7 12665438000178 <NULL>         <NULL>            
#>  8 63599658000181 <NULL>         <NULL>            
#>  9 12755123000111 <NULL>         <NULL>            
#> 10 12376766000154 <NULL>         <NULL>            
#> # ... with 21 more rows

【讨论】：