使用 R 下载压缩数据文件、提取和导入数据答案

【问题标题】：Using R to download zipped data file, extract, and import data使用 R 下载压缩数据文件、提取和导入数据
【发布时间】：2011-03-04 11:42:36
【问题描述】：

@EZGraphs 在 Twitter 上写道： “很多在线 csv 都被压缩了。有没有办法下载、解压缩存档并使用 R 将数据加载到 data.frame？#Rstats”

我今天也尝试这样做，但最终只是手动下载了 zip 文件。

我尝试了类似的方法：

fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")

但我觉得我的路还很长。有什么想法吗？

【问题讨论】：

成功了吗？如果是这样，为什么你还会觉得自己的路还很长？
@Frustrated... 不。我问题中的代码不起作用。请参阅下面的答案。

标签： r connection zip r-faq

【解决方案1】：

Zip 档案实际上更像是一个包含内容元数据等的“文件系统”。有关详细信息，请参阅help(unzip)。因此，要完成您在上面勾画的内容，您需要

创建一个临时。文件名（例如tempfile()）
使用download.file() 将文件提取到临时文件中。文件
使用unz() 从 temp.xml 中提取目标文件。文件
通过unlink()删除临时文件

代码中的哪个（感谢基本示例，但这更简单）看起来像

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

压缩 (.z) 或 gzipped (.gz) 或 bzip2ed (.bz2) 文件只是文件，您可以直接从连接中读取这些文件。所以让数据提供者使用它:)

【讨论】：

Dirk，您介意扩展一下如何从.z 存档中提取数据吗？我可以从带有readBin(url(x, "rb"), 'raw', 99999999) 的 url 连接中读取，但我将如何提取包含的数据？ uncompress 包已从 CRAN 中删除 - 这在基础 R 中是否可行（如果可以，是否仅限于 *nix 系统？）？如果合适，很高兴作为新问题发布。
见help(gzfile)——我在想gzip协议现在也可以解压缩（石头旧的）.z文件，因为专利已经过期了。它可能不会。谁使用 .z 呢？ 1980 年代打来电话，他们希望恢复压缩 ;-)
谢谢 - 我无法让它工作，所以也许它毕竟不受支持。不幸的是，澳大利亚气象局以 .z 格式提供了一些数据！
仅供参考它不适用于readRDS()（至少对我而言）。据我所知，该文件需要位于一种您可以使用read.table() 阅读的文件中。
您还需要关闭连接。 R一次只能打开125个。像 con

【解决方案2】：

为了记录，我尝试将 Dirk 的答案翻译成代码：-P

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)

【讨论】：

不要使用scan()；您可以直接在连接上使用read.table() 等。查看我编辑的答案，

【解决方案3】：

我使用了位于 http://cran.r-project.org/web/packages/downloader/index.html 的 CRAN 包“下载器”。容易得多。

download(url, dest="dataset.zip", mode="wb") 
unzip ("dataset.zip", exdir = "./")

【讨论】：

我只是使用 utils::unzip 不需要下载包
截至 2019 年 - 我不得不说 exdir='.'

【解决方案4】：

对于 Mac（我假设是 Linux）...

如果 zip 存档包含单个文件，您可以使用 bash 命令 funzip，与 data.table 包中的 fread 结合使用：

library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")

如果存档包含多个文件，您可以改用tar 将特定文件提取到标准输出：

dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")

【讨论】：

当我为多个文件尝试您的解决方案时，我收到一个错误 File is empty:

【解决方案5】：

这是一个适用于无法使用read.table 函数读取的文件的示例。此示例读取一个 .xls 文件。

url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"

temp <- tempfile()
temp2 <- tempfile()

download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))

unlink(c(temp, temp2))

【讨论】：

【解决方案6】：

要使用 data.table 执行此操作，我发现以下方法有效。不幸的是，该链接不再有效，因此我使用了另一个数据集的链接。

library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)

我知道这可以在一行中完成，因为您可以将 bash 脚本传递给 fread，但我不知道如何下载 .zip 文件、解压缩并将单个文件传递给 fread。

【讨论】：

【解决方案7】：

试试这个代码。它对我有用：

unzip(zipfile="<directory and filename>",
      exdir="<directory where the content will be extracted>")

例子：

unzip(zipfile="./data/Data.zip",exdir="./data")

【讨论】：

【解决方案8】：

我发现以下内容对我有用。这些步骤来自 BTD 的 YouTube 视频，Managing Zipfile's in R：

zip.url <- "url_address.zip"

dir <- getwd()

zip.file <- "file_name.zip"

zip.combine <- as.character(paste(dir, zip.file, sep = "/"))

download.file(zip.url, destfile = zip.combine)

unzip(zip.file)

【讨论】：

【解决方案9】：

rio() 非常适合这个 - 它使用文件名的文件扩展名来确定它是什么类型的文件，因此它适用于多种文件类型。我还使用unzip() 列出了 zip 文件中的文件名，因此无需手动指定文件名。

library(rio)

# create a temporary directory
td <- tempdir()

# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")

# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)

# list zip archive
file_names <- unzip(tf, list=TRUE)

# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)

# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))

# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))

# delete the files and directories
unlink(td)

【讨论】：