【问题标题】:R - curl - download remote file only when changedR - curl - 仅在更改时下载远程文件
【发布时间】:2017-01-16 05:49:13
【问题描述】:
【问题讨论】:
标签:
r
curl
timestamp
handle
httr
【解决方案1】:
您必须保留文件的最后修改日期的历史记录(假设 Web 服务器在报告该日期时保持一致)并在下载前与 httr::HEAD() 进行检查(即您有一些工作要做将最后修改的值存储在某处,可能在带有 URL 的数据框中):
library(httr)
URL <- "http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata"
#' Download a file only if it hasn't changed since \code{last_modified}
#'
#' @param URL url of file
#' @param fil path to write file
#' @param last_modified \code{POSIXct}. Ideally, the output from the first
#' successful run of \code{get_file()}
#' @param overwrite overwrite the file if it exists?
#' @param .verbose output a message if the file was unchanged?
get_file <- function(URL, fil, last_modified=NULL, overwrite=TRUE, .verbose=TRUE) {
if ((!file.exists(fil)) || is.null(last_modified)) {
res <- GET(URL, write_disk(fil, overwrite))
return(httr::parse_http_date(res$headers$`last-modified`))
} else if (inherits(last_modified, "POSIXct")) {
res <- HEAD(URL)
cur_last_mod <- httr::parse_http_date(res$headers$`last-modified`)
if (cur_last_mod != last_modified) {
res <- GET(URL, write_disk(fil, overwrite))
return(httr::parse_http_date(res$headers$`last-modified`))
}
if (.verbose) message(sprintf("'%s' unchanged since %s", URL, last_modified))
return(last_modified)
}
}
# first run == you don't know the last-modified date.
# you need to pair this with the URL in some data structure for later use.
last_mod <- get_file(URL, basename(URL))
class(last_mod)
## [1] "POSIXct" "POSIXt"
last_mod
## [1] "2015-11-16 17:34:06 GMT"
last_mod <- get_file(URL, basename(URL), last_mod)
#> 'http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata' unchanged since 2015-11-16 17:34:06
【解决方案2】:
httr 包的替代方法是 base 函数 base::curlGetHeaders(url),但您仍然需要自己解析上次修改的日期!