在 R 中使用并行计算进行异常检测答案

【问题标题】：anomaly detection using Paralel computing in R在 R 中使用并行计算进行异常检测
【发布时间】：2021-08-25 23:51:37
【问题描述】：

我有一个包含超过 3 亿行的数据框，我想检测每个组中的异常，该组由国家和 ID（每个组）组成，然后我编写了以下代码来检测异常点，但这需要很长时间.您能否建议任何其他使其更快的选项。数据框格式：

 df <- data.frame("id" = 1:n,"country"= ("US",..),"date"=("2021-01-01",..),"value"=c(10,....)) 


    registerDoParallel()
groupColumns <- c("country","id")
system.time(temp_anom <- ddply(df, groupColumns, function(x){
  x <- x[,c('date','value')]  
  resid.q <- quantile(x$value,prob = c(0.1,0.90))
  iqr <- diff(resid.q)
  limits <- resid.q + 3 * iqr * c(-1,1) 
  lower_bound <- limits[1]
  upper_bound <- limits[2]
  outlier_dip_index <- dplyr::filter(x, value < lower_bound) %>% data.frame() 
  if (nrow(outlier_dip_index) > 0) {
    outlier_dip_index$status <- "dip"}
  outlier_spike_index <- dplyr::filter(x, value > upper_bound) %>% data.frame()
  if (nrow(outlier_spike_index) > 0) {
    outlier_spike_index$status <- "spike"  
    outlier <- rbind(outlier_spike_index,outlier_dip_index)
    outlier
  }
},.paralle = T))

【问题讨论】：

标签： r doparallel

【解决方案1】：

为了提高并行计算的速度，我们需要在 Doparallel 中找到最佳的核数，在这种情况下，最佳值是 5。只有像下面这样修改代码，我们才能看到种子的巨大改进。

doParallel::registerDoParallel(cores = 5)

system.time(temp_anom <- plyr::ldply(df$id, function(ids){
  title_dataset <- df[which(df$short_id == ids),]
  result_dataset <- plyr::ldply(title_dataset$country, function(iso){
    country_dataset <- title_dataset[which(title_dataset$country == iso),]
    resid.q <- quantile(country_dataset$raw_de, prob = c(0.1, 0.90))
    iqr <- diff(resid.q)
    limits <- resid.q + 3 * iqr * c(-1,1) 
    temp_dataset <- data.frame(country = iso, lower_bound = limits[1], upper_bound = limits[2])
    temp_dataset
  })
  result_dataset$id <- ids
  result_dataset
}, .parallel = T))

【讨论】：