递归ftp下载，然后解压gz文件答案

【问题标题】：Recursively ftp download, then extract gz files递归ftp下载，然后解压gz文件
【发布时间】：2011-03-08 01:49:10
【问题描述】：

我想在 R 中执行多步骤文件下载过程。我有中间步骤，但没有第一步和第三步...

# STEP 1 Recursively find all the files at an ftp site 
# ftp://prism.oregonstate.edu//pub/prism/pacisl/grids
all_paths <- #### a recursive listing of the ftp path contents??? ####

# STEP 2 Choose all the ones whose filename starts with "hi"
all_files <- sapply(sapply(strsplit(all_paths, "/"), rev), "[", 1)
hawaii_log <- substr(all_files, 1, 2) == "hi"
hi_paths <- all_paths[hawaii_log]
hi_files <- all_files[hawaii_log]

# STEP 3 Download & extract from gz format into a single directory
mapply(download.file, url = hi_paths, destfile = hi_files)
## and now how to extract from gz format?

【问题讨论】：

必须是 R 吗？ HTTP 充其量是可以通过的，但它在 FTP 上并不是很好。更通用的语言，比如 python，会更适合这类问题。
是的，我试图避免添加任何外部工具......现在我已经通过从 R 调用命令行 wget 做了一个解决方法，但我希望能够将它传递给某人作为一个独立的 R 脚本
复制和粘贴文本文件名并在循环中使用 download.file 很容易 - 所以它是为您的用户硬编码的，但仍然是独立的（或者您可以通过 ftp 进入站点并mget . . .)
您可以使用dir(pattern = "^hi.+$", ignore.case = TRUE) 获取所有“hi”文件。

标签： r

【解决方案1】：

对于第 1 部分，RCurl 可能会有所帮助。 getURL 函数检索一个或多个 URL； dirlistonly 列出目录的内容而不检索文件。函数的其余部分创建下一级 url

library(RCurl)
getContent <- function(dirs) {
    urls <- paste(dirs, "/", sep="")
    fls <- strsplit(getURL(urls, dirlistonly=TRUE), "\r?\n")
    ok <- sapply(fls, length) > 0
    unlist(mapply(paste, urls[ok], fls[ok], sep="", SIMPLIFY=FALSE),
           use.names=FALSE)
}

所以开始

dirs <- "ftp://prism.oregonstate.edu//pub/prism/pacisl/grids"

我们可以调用这个函数并寻找看起来像目录的东西，继续直到完成

fls <- character()
while (length(dirs)) {
    message(length(dirs))
    urls <- getContent(dirs)
    isgz <- grepl("gz$", urls)
    fls <- append(fls, urls[isgz])
    dirs <- urls[!isgz]
}

然后我们可以再次使用getURL，但这次使用fls（或fls 的元素，在循环中）来检索实际文件。或者最好打开一个 url 连接并使用 gzcon 解压缩和处理文件。沿着

con <- gzcon(url(fls[1], "r"))
meta <- readLines(con, 7)
data <- scan(con, integer())

【讨论】：

这对我不起作用：我得到一个 1 5 然后 Error in dots[[1L]][[1L]] : subscript out of bounds 我试图通过：第一个 fls 分配在网址末尾添加一个 \r这似乎不是一个有效的目录。有趣的是dirlistonly 没有出现在getURL() 帮助页面中。
我猜在 Windows 上 strsplit 应该是 "\r\n*"。 RCurl 取决于系统库，可用的特定选项取决于安装的库的版本。见listCurlOptions()；在 Linux / MacOS 上可以使用man curl_easy_setopt；不确定 Windows。

【解决方案2】：

如果我用internet2 选项启动R，我可以读取ftp 页面的内容。 IE。

C:\Program Files\R\R-2.12\bin\x64\Rgui.exe --internet2

（在 Windows 上启动 R 的快捷方式可以修改为添加 internet2 参数 - 右键单击 /Properties /Target，或者只是在命令行中运行它 - 在 GNU/Linux 上很明显）。

该页面上的文本可以这样阅读：

 download.file("ftp://prism.oregonstate.edu//pub/prism/pacisl/grids", "f.txt")
 txt <- readLines("f.txt")

解析目录列表需要做更多的工作，然后递归地读取它们以获取底层文件。

## (something like)
dirlines <- txt[grep("Directory <A HREF=", txt)]

## split and extract text after "grids/"
split1 <- sapply(strsplit(dirlines, "grids/"), function(x) rev(x)[1])

## split and extract remaining text after "/"
sapply(strsplit(split1, "/"), function(x) x[1])
[1] "dem"    "ppt"    "tdmean" "tmax"   "tmin"

正是在这里，这不再看起来很有吸引力，而且有点费力，所以我实际上会推荐一个不同的选择。毫无疑问，使用 RCurl 可能会有更好的解决方案，我建议为您和您的用户学习使用和 ftp 客户端。命令行 ftp、匿名登录和 mget 都很容易工作。

internet2 选项在此处针对类似的 ftp 站点进行了说明：

https://stat.ethz.ch/pipermail/r-help/2009-January/184647.html

【讨论】：

第一部分很容易知道。除了启动选项，还有setInternet2(TRUE)。我想子目录的递归函数是一种从那里开始的方法，但至少现在我可以从页面中获取文本。

【解决方案3】：

ftp.root <- where are the files
dropbox.root <- where to put the files

#=====================================================================
#   Function that downloads files from URL
#=====================================================================

fdownload <- function(sourcelink) { 

  targetlink <- paste(dropbox.root, substr(sourcelink, nchar(ftp.root)+1, 
nchar(sourcelink)), sep = '')

  # list of contents
  filenames <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = TRUE)
  filenames <- strsplit(filenames, "\n")
  filenames <- unlist(filenames)

  files <- filenames[grep('\\.', filenames)]  
  dirs <- setdiff(filenames, files)
  if (length(dirs) != 0) {
    dirs <- paste(sourcelink, dirs, '/', sep = '')
  }  

  # files
  for (filename in files) {

    sourcefile <- paste(sourcelink, filename, sep = '')
    targetfile <- paste(targetlink, filename, sep = '')

    download.file(sourcefile, targetfile)
  }

  # subfolders
  for (dirname in dirs) {

    fdownload(dirname)
  }
}

【讨论】：

调用函数：fdownload(ftp.root)