使用 R 中另一个数据处理大型 DataTable答案

【问题标题】：Process large DataTable using data from another in R使用 R 中另一个数据处理大型 DataTable
【发布时间】：2020-12-29 19:33:49
【问题描述】：

我需要处理一个大的 data.table，在一个列中设置一个值，其中两个其他列和另一个单独的（小）搜索词 data.table 之间存在匹配。该列的值取自找到匹配项的搜索词。

在以下位置找到匹配项：

主 DT 中的字段（示例代码中的状态）与搜索词 DT (searchStatus) 中的 str_detect 模式匹配
如果为上述 str_detect 模式指定，则来自主 DT 的日期必须在搜索词指定的日期范围内（从和到）

我在下面的代码中简化了我的用例，它使用了默认的风暴数据集（10k 条记录）。主DT有650万条记录，搜索词DT不到200条。

我的两个问题是：

有没有办法提高效率（并减少估计的处理 650 万条记录需要 90 分钟大数据表？
我如何获得 data.table 结果而不是矩阵被转置（以列为行，以行为列）？

非常感谢。

library(tidyverse)
library(lubridate)
library(data.table)

data("storms")
setDT(storms)
storms <- storms %>% 
  mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())

# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01") 
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to) 
setDT(searchTerms)

# Function to determine if any search terms apply to the given row
# Typically only one is expected, although not guaranteed, so the first is taken
# A replacedStatus field is added containing either
# - the parameterised replacementStatus where a matching search term has been identified
# - the original status where no match has been identified 
recodeValues <- function(row) { 
  date = row["date"]
  status = row["status"]
  recodeMatch <- head(
    searchTerms[str_detect(status, searchTerms$searchStatus) &
                  (searchTerms$from == "" | date >= as.Date(searchTerms$from)) &
                  (searchTerms$to == ""   | date <= as.Date(searchTerms$to))
                  ,]
      ,1)
  # I would use mult = "first" in the selection above but it seem to have no impact, so use head() instead
  row["replacedStatus"]  <- if_else(nrow(recodeMatch) > 0, recodeMatch[1]$replacementStatus, status)
  return(row)
}

cat("Starting Recoding", "\n")
processorTime <- proc.time()
result <- apply(storms, 1, recodeValues)
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
View(result)

【问题讨论】：

我不完全确定你的意思，但它会找到第一个匹配项并使用它来确定新字段的设置，replacedStatus。除了转置矩阵格式之外，它确实有效。

标签： r data.table tidyverse

【解决方案1】：

如果我正确理解您想要什么，那么迭代小型“searchTerms”data.table 可能比迭代大型“风暴”更有意义。

然后你可以做这样的事情，更好地利用data.table的力量：

library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>% 
    mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())

# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01") 
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to) 
setDT(searchTerms)

cat("Starting Recoding", "\n")
#> Starting Recoding
processorTime <- proc.time()
for(i in seq_len(dim(searchTerms)[1])){
    x <- as.list(searchTerms[i])
    if(x$from == "") {
        storms[grep(x$searchStatus, status),
               status:= x$replacementStatus]  
    } else {
        storms[grep(x$searchStatus, status) &
                   between(date, as.Date(x$from), as.Date(x$to)),
               status:= x$replacementStatus]
    }
}

cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
#> Recoding time (Elapsed): 0.034  seconds
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
#> Estimated Recoding time (Elapsed) for 6.5m records: 0.3787879  minutes
tail(storms[])
#>    name year month day hour  lat  long             status category wind
#> 1: Kate 2015    11  10   12 29.5 -75.4     tropical storm        0   60
#> 2: Kate 2015    11  10   18 31.2 -74.0     tropical storm        0   60
#> 3: Kate 2015    11  11    0 33.1 -71.3 hurricane any time        1   65
#> 4: Kate 2015    11  11    6 35.2 -67.6 hurricane any time        1   70
#> 5: Kate 2015    11  11   12 36.2 -62.5 hurricane any time        1   75
#> 6: Kate 2015    11  11   18 37.6 -58.2 hurricane any time        1   65
#>    pressure ts_diameter hu_diameter       date
#> 1:      998    103.5702      0.0000 2015-11-10
#> 2:      993    103.5702      0.0000 2015-11-10
#> 3:      990    161.1092     23.0156 2015-11-11
#> 4:      985    207.1404     23.0156 2015-11-11
#> 5:      980    345.2340     34.5234 2015-11-11
#> 6:      980    379.7574     46.0312 2015-11-11

【讨论】：

嗨@user12728748，这是一个很好的答案。我可以看到循环在小表上（因此相对低效率没有实际影响），并且使用 data.table 的高效搜索功能处理大表。结果，运行时间现在是原来的 1/240，所以处理我的 650 万条记录会非常快。不仅可以解决这个问题，而且可以为我和其他人学习其他要求。谢谢。