【问题标题】:R download file redirect errorR下载文件重定向错误
【发布时间】:2017-12-02 00:07:28
【问题描述】:

您好,我正在尝试使用 R 通过 ProPublica NonProfit Explorer API 下载 pdf 文件:https://projects.propublica.org/nonprofits/api

当我查询 API 时,它会返回指向 pdf 的链接。然而,这些链接重定向到 AWS,例如https://projects.propublica.org/nonprofits/download-filing?path=2015_06_T%2F13-1624100_990T_201406.pdf

我已尝试按照此讨论中的建议指定method = "curl", extra='-L'R download file redirect。这将返回状态 127。

我也尝试过使用 CRAN 的“Downloader”包。这会下载一个文件,但它似乎以某种方式损坏,因为当我尝试打开它时 Adob​​e 说“内存不足”。

有人有什么建议吗?

【问题讨论】:

    标签: r pdf curl httr


    【解决方案1】:

    只需使用httr(您也应该使用它来访问 API)。 write_disk() 是你的闺蜜:

    library(httr)
    
    pp_doc_url <- "https://projects.propublica.org/nonprofits/download-filing?path=2015_06_T%2F13-1624100_990T_201406.pdf"
    
    GET(
      url = pp_doc_url,
      write_disk("file.pdf"),
      verbose()
    ) -> res
    

    这是显示重定向的详细输出:

    ## -> GET /nonprofits/download-filing?path=2015_06_T%2F13-1624100_990T_201406.pdf HTTP/1.1
    ## -> Host: projects.propublica.org
    ## -> User-Agent: libcurl/7.54.0 r-curl/3.0 httr/1.3.1
    ## -> Accept-Encoding: gzip, deflate
    ## -> Accept: application/json, text/xml, application/xml, */*
    ## -> 
    ## <- HTTP/1.1 302 Found
    ## <- Content-Type: text/html; charset=utf-8
    ## <- X-Frame-Options: SAMEORIGIN
    ## <- X-XSS-Protection: 1; mode=block
    ## <- X-Content-Type-Options: nosniff
    ## <- Location: https://pp-990.s3.amazonaws.com/2015_06_T/13-1624100_990T_201406.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI7C6X5GT42DHYZIA%2F20171202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20171202T002756Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&X-Amz-Signature=f90caae6a793239be8342d0ecbd96ff6f80b1821921cfadae00f78129a38a79f
    ## <- Cache-Control: max-age=0, private, must-revalidate
    ## <- Content-Encoding: gzip
    ## <- Transfer-Encoding: chunked
    ## <- Accept-Ranges: bytes
    ## <- Date: Sat, 02 Dec 2017 00:27:57 GMT
    ## <- Via: 1.1 varnish
    ## <- Connection: keep-alive
    ## <- X-Served-By: cache-bos8228-BOS
    ## <- X-Cache: MISS
    ## <- X-Cache-Hits: 0
    ## <- X-Timer: S1512174477.810292,VS0,VE194
    ## <- Vary: Accept,Accept-Encoding,Content-Type
    ## <- 
    ## -> GET /2015_06_T/13-1624100_990T_201406.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI7C6X5GT42DHYZIA%2F20171202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20171202T002756Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&X-Amz-Signature=f90caae6a793239be8342d0ecbd96ff6f80b1821921cfadae00f78129a38a79f HTTP/1.1
    ## -> Host: pp-990.s3.amazonaws.com
    ## -> User-Agent: libcurl/7.54.0 r-curl/3.0 httr/1.3.1
    ## -> Accept-Encoding: gzip, deflate
    ## -> Accept: application/json, text/xml, application/xml, */*
    ## -> 
    ## <- HTTP/1.1 200 OK
    ## <- x-amz-id-2: fycJGU5JQZ+o+aTOWFa86ZFyasv7XEH6RGsmXNo29+CtgDC8IZ438Ek61Bo/nUlRhk3fPKPXdMg=
    ## <- x-amz-request-id: AB2E8B3421A6B7BB
    ## <- Date: Sat, 02 Dec 2017 00:27:58 GMT
    ## <- Last-Modified: Thu, 13 Aug 2015 19:22:03 GMT
    ## <- ETag: "fd89377252531684bec1828db05c54e6"
    ## <- Cache-Control: no-cache, no-store
    ## <- Content-Language: en
    ## <- Accept-Ranges: bytes
    ## <- Content-Type: application/pdf
    ## <- Content-Length: 537542
    ## <- Server: AmazonS3
    ## <- 
    

    这是响应对象的内容:

    res
    ## Response [https://pp-990.s3.amazonaws.com/2015_06_T/13-1624100_990T_201406.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI7C6X5GT42DHYZIA%2F20171202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20171202T002756Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&X-Amz-Signature=f90caae6a793239be8342d0ecbd96ff6f80b1821921cfadae00f78129a38a79f]
    ##   Date: 2017-12-02 00:27
    ##   Status: 200
    ##   Content-Type: application/pdf
    ##   Size: 538 kB
    ## <ON DISK>  file.pdf
    

    而且,这是文件已下载的证据:

    file.info("file.pdf")
    ##            size isdir mode               mtime               ctime               atime uid gid    uname grname
    ## file.pdf 537542 FALSE  644 2017-12-01 19:27:57 2017-12-01 19:27:57 2017-12-01 19:27:58 xxx  xx xxxxxxxx  xxxxx
    

    在“生产”中离开verbose()

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-12-15
      • 2013-07-26
      • 2017-08-20
      • 2015-02-04
      • 2011-08-30
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多