R中的并行处理不使用所有内核答案

【问题标题】：Parallel processing in R doesn´t use all coresR中的并行处理不使用所有内核
【发布时间】：2017-09-09 19:40:20
【问题描述】：

我正在尝试并行化在矩阵行上运行的进程。我希望对于行中的每个元素，即一个物种，它会提取并写入一个文件（栅格），该文件（栅格）对应于每个物种在其栖息地的分布。

Habitas 图层是一个栅格文件，每个物种分布都是来自 shapefile 的一个多边形（或一组多边形）。我首先将物种多边形转换为栅格，然后提取物种的栖息地（存储在物种栖息地代码与栖息地栅格值匹配的矩阵中），最后将分布和栖息地相交（相乘） .

另外，我想制作一个丰富度（物种数量图）文件（光栅）。然后，我将（总和）添加到每个最终物种分布的空栅格（值为零）。我写了以下函数：

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
  #Libraries
  library(rgdal)
  library(raster)
  #raster file with zeros
  richness_cur=raster("richness_current.tif")
  #Selection of species polygons
  rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
                          as.character(spp_data$binomial[k])))
  spp_poly=spp_polygons[rows,]
  #Covert polygon(s) to raster
  spp_poly=rasterize(spp_poly,cover,1,background=0)
  #Match species habitats codes with habitats raster values
  habs=as.character(spp_data$hab_code[k])
  habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
  cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
                                      [which(as.character(habitat_codes[,1])%in%habs)]))
  #Intersect species distributions with habitats
  cov_mask=spp_poly*cover
  #Extract species habitats
  cov_mask=Which(cov_mask%in%cov_classes)
  writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
  #Sum species richness
  richness_cur=richness_cur+cov_mask
  return (richness_cur)
}

我尝试使用 clusterApply 和 foreach 函数来并行化该过程。但是，我无法在这两个函数中的任何一个函数中从函数返回栅格对象（这在常规循环函数中很明显），以向该对象添加物种丰富度的总和。所以，这是我的第一个问题。 1。有谁知道如何在并行化过程中返回不同于列表、矩阵或向量的对象？

我在每次“迭代”中编写丰富度文件来解决这个问题。然而，这个选项会导致过程变慢，所以对我来说，这不是理想的选择。然后，函数改写如下：

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
  #Libraries
  library(rgdal)
  library(raster)
  #raster file with zeros
  richness_cur=raster("richness_current.tif")
  #Selection of species polygons
  rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
                          as.character(spp_data$binomial[k])))
  spp_poly=spp_polygons[rows,]
  #Covert polygon(s) to raster
  spp_poly=rasterize(spp_poly,cover,1,background=0)
  #Match species habitats codes with habitats raster values
  habs=as.character(spp_data$hab_code[k])
  habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
  cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
                                      [which(as.character(habitat_codes[,1])%in%habs)]))
  #Intersect species distributions with habitats
  cov_mask=spp_poly*cover
  #Extract species habitats
  cov_mask=Which(cov_mask%in%cov_classes)
  writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
  #Sum species richness
  richness_cur=richness_cur+cov_mask
  writeRaster(richness_cur,"richness_current.tif")
}

运行并行化的完整代码是：

#Number of cores
no_cores=detectCores()-1
#Initiate cluster
cl=makeCluster(no_cores,type="PSOCK")
registerDoParallel(cl)

#Table with name and habitat information (columns) for each species (rows)
spp_data=read.xlsx2("species_file.xls",sheetIndex=1)
#Shape file with species distributions as polygons
spp_polygons=readOGR("distributions.shp")
#Separation symbol for species habitats stored in spp_data
sep=";"
#Tabla joining habitas species codes with habitats raster
habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1)
#Habitats raster
cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif")

#Paralelization
foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k,
                                                     spp_data=spp_data,
                                                     spp_polygons=spp_polygons,sep=sep,
                                                     habitat_codes=habitat_codes,
                                                     cover=cover)
stopImplicitCluster()
stopCluster(cl)

并行化进程运行；但是，它没有按我的预期工作，因为它没有使用所有内核：Image of processors working。因此，并行化过程的作用是启动 39 个（核心数）进程：Image of processes opened，但它不会一一写入文件，这是我在常规循环中所期望的。它突然写了 39 个文件块（我能理解），但是花费了很多时间（因为它似乎在几个内核中工作），甚至比我运行常规循环（运行常规循环每个文件都写每两到三分钟一次，而 39 个文件的块大约每一小时写入一次）。

所以，这是我的第二组问题。 2. 我做错了什么？ 3. 为什么它没有使用所有 39 个处理器，或者它使用它们，为什么它没有在最大级别使用它们？ 4.为什么它完成一个任务后不开始另一个任务（我猜是因为它总是以39块为单位写入文件）？

提前感谢您的帮助。

干杯，

詹姆

【问题讨论】：

如果没有数据来重现您的示例，很难帮助您。

标签： r parallel-processing parallel.foreach

【解决方案1】：

有谁知道如何在并行化过程中返回不同于列表、矩阵或向量的对象？

对于您的第一个问题，这没有意义。你想返回什么样的对象？列表可以包含任何 R 对象。

为什么它不使用所有 39 个处理器，或者它使用它们，为什么它没有在最高级别使用它们？

有很多潜在的原因。查看您的代码，一个原因可能是磁盘 IO 受限，因为您将大量图像写入磁盘。另一个潜在原因是内存大小限制。

我做错了什么？

如果您使用的是 Linux（或任何非 Windows），则应该使用基本 R 并行包中的 mclapply 函数。

【讨论】：