如何在R中自动下载多个链接断开的图像？答案

【问题标题】：How to automatically download multiple images with broken links in R?如何在R中自动下载多个链接断开的图像？
【发布时间】：2021-08-30 22:53:35
【问题描述】：

这里的目标是下载一堆图片，但有些图片的 URL 已损坏。我想要做的是使用简单的 next 语句修改代码，以便如果链接返回除状态代码 200 之外的任何内容，则跳到下一个 URL（或者如果链接返回 404 跳到下一个），但我不确定如何用矢量化代码编写它，当我尝试在 for 循环中编写它时，我无法弄清楚如何初始化“图片”类型的向量以在 for 循环中写入。所以现在我正在查看函数的代码，试图找出错误被调用的位置以及在哪里放置下一条语句或类似的东西......如果你不能以某种形式的矢量化代码放置下一条语句：

简单向量化代码：

library(magick)
library(rsvg)

image_urls <- na.omit(articles$url_to_image)
image_content <- image_read(image_urls)

不透明的“功能”代码（错误在哪里被调用？---只是一堆下载不同类型图像的调用）

function (path, density = NULL, depth = NULL, strip = FALSE, 
    coalesce = TRUE, defines = NULL) 
{
    if (is.numeric(density)) 
        density <- paste0(density, "x", density)
    density <- as.character(density)
    depth <- as.integer(depth)
    
    #doesn't seem relevant: https://rdrr.io/cran/magick/src/R/defines.R
    defines <- validate_defines(defines)
    
    #test whether the object is an instance of an S4 class and a function to test inheritance relationships between object and class -- seems relevant maybe?
    image <- if (isS4(path) && methods::is(path, "Image"))
      {
        #bioconductor class
        convert_EBImage(path)
    }
    else if (inherits(path, "nativeRaster") || (is.matrix(path) && 
        is.integer(path))) {
        image_read_nativeraster(path)
    }
    else if (inherits(path, "cimg")) {
        image_read_cimg((path))
    }
    else if (grDevices::is.raster(path)) {
        image_read_raster2(path)
    }
    else if (is.matrix(path) && is.character(path)) {
        image_read_raster2(grDevices::as.raster(path))
    }
    else if (is.array(path)) {
        image_readbitmap(path)
    }
    else if (is.raw(path)) {
        magick_image_readbin(path, density, depth, strip, defines)
    }
    else if (is.character(path) && all(nchar(path))) {
        path <- vapply(path, replace_url, character(1))
        path <- if (is_windows()) {
            enc2utf8(path)
        }
        else {
            enc2native(path)
        }
        magick_image_readpath(path, density, depth, strip, defines)
    }
    else {
        stop("path must be URL, filename or raw vector")
    }
    if (is.character(path) && !isTRUE(magick_config()$rsvg)) {
        if (any(grepl("\\.svg$", tolower(path))) || any(grepl("svg|mvg", 
            tolower(image_info(image)$format)))) {
            warning("ImageMagick was built without librsvg which causes poor qualty of SVG rendering.\nFor better results use image_read_svg() which uses the rsvg package.", 
                call. = FALSE)
        }
    }
    if (isTRUE(coalesce) && length(image) > 1 && identical("GIF", 
        toupper(image_info(image)$format[1]))) {
        return(image_coalesce(image))
    }
    return(image)
}

当链接断开时，它会返回：download_url(path) 中的错误：当 URL 损坏时，无法下载“链接”（HTTP 404）

可能的循环代码？

library(magick)
library(rsvg)

image_urls <- na.omit(articles$url_to_image)

image_content <- c() #doesn't work, nor does NULL 
#nor does setting to typeof image_content <- image_url[1]

for(i in 1:length(image_urls){
  image_content[i] = image_read(image_urls[i])
    if(grepl('404', download_path(url), fixed = TRUE) == T)
    next
}

但同样，我无法初始化，而且我不知道循环是否会在到达 if 语句之前中断。

也许我应该使用另一个库......或者只是另一种语言？

这是一些示例数据

data <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f", 
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f", 
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")

【问题讨论】：

(1) articles$url_to_image 未定义。整个na.omit() 函数是否应该被您的示例数据data 替换？ (2) download_path(url) 未定义。
感谢您的 cmets。 download_path(url) 是基于收到的不清楚的错误的理想化函数。 articles$url_to_image 应该替换为示例数据，您是正确的，谢谢您指出这一点。

标签： r web-scraping image-processing

【解决方案1】：

你可以试试try函数：

image_urls <- data

image_content <- lapply(seq_along(image_urls), function(i) try(image_read(image_urls[i])))

这会将您的图像存储在一个列表中。使用

image_content[[1]]

让您可以访问第一张图片。如果有类似的错误

Error in curl::curl_fetch_memory(url) : 
Could not resolve host: img-s-msn-com.net simpleError in curl::curl_fetch_memory(url)

那些被跳过，循环继续下一个任务。

【讨论】：

我想到了 try 函数……有什么理由让 lapply 超过 vapply？
我很少使用vapply，我希望输出是一个列表。所以lapply是我的首选。
这完全有道理，我的问题很愚蠢，因为图片格式，输出不能是矢量......我的坏......：/对不起。
这个问题并不愚蠢。无需抱歉。 :-)

【解决方案2】：

另一种选择是使用purrr::safely 创建image_read 的“安全”版本，它将为每个网址返回result 和error。

可以使用purrr::map(y,`[[`, 'result') 之类的方式从列表中提取结果。

# two working links and one broken
urls <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f", 
          "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f", 
          "https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")

# create 'safe' function
image_read_safe <- purrr::safely(magick::image_read)

# apply 'safe' function
y <- purrr::map(urls, image_read_safe)

y
#> [[1]]
#> [[1]]$result
#>   format width height colorspace matte filesize density
#> 1   JPEG   799    488       sRGB FALSE    39743   96x96
#> 
#> [[1]]$error
#> NULL
#> 
#> 
#> [[2]]
#> [[2]]$result
#>   format width height colorspace matte filesize density
#> 1   JPEG   799    533       sRGB FALSE    53910   96x96
#> 
#> [[2]]$error
#> NULL
#> 
#> 
#> [[3]]
#> [[3]]$result
#> NULL
#> 
#> [[3]]$error
#> <simpleError in curl::curl_fetch_memory(url): Could not resolve host: img-s-msn-com.net>

^{由reprex package (v2.0.0) 于 2021-09-10 创建}

【讨论】：

不错的一个。之前没见过safely。
在这种情况下，跟踪错误实际上对我很有帮助。但是，当我打印 y 时，我没有得到有关图片的格式信息。很好奇为什么会这样。
在我使用 RStudio 的机器上，当我打印列表时，它会将您在上面看到的输出写入控制台，并在预览窗口中显示图像。不知道为什么 reprex 没有提取图像，或者为什么它在您的计算机上显示的可能不同。
@nniloc 这次它正确显示了。我想知道您是否可以更新您的答案以使用我的问题中的新数据？非常抱歉给您带来麻烦！
没问题。编辑添加到新的网址。