R中的矢量化循环操作答案

【问题标题】：Vectorizing loop operation in RR中的矢量化循环操作
【发布时间】：2020-06-03 10:17:04
【问题描述】：

我有一个长格式平衡数据框 (df1)，它有 7 列：

df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"), 
    Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960, 
    1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L, 
    1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L, 
    1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L, 
    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC", 
    "DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
15L), class = "data.frame")

此数据集中的每个 Product_ID 都与唯一的 Product_Category 和 Country_Code 以及 Manufacture_Date 相关联，并且随着时间的推移 (Control_Date) 被跟踪。 Product_Category 有两个可能的值（A 或 B）； Country_Code 和 Manufacture_Date 分别有 190 和 90 个唯一值。有 400,000 个唯一的 Product_ID，在 50 年期间（Control_Date 从 1961 年到 2010 年）被跟踪。这意味着 df1 有 20,000,000 行。此数据框的最后两列在开头为 NA，必须使用另一个数据框 (df2) 中可用的数据来填充：

df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L, 
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), 
    Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000), 
    Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC", 
    "DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10, 
    NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8, 
    6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6, 
    NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")

第二个数据框包含关于完全相同的 400,000 种产品的另一种宽格式信息。每行代表一个唯一的产品 (Product_ID)，并附有其 Product_Category、Manufacture_Date 和 Country_Code。还有 50 个其他列（从 1961 年到 2010 年的每一年），其中包含这些年份中每种产品的测量值（或 NA）。

现在我想做的是通过对第二个数据帧中可用的数据进行一些计算来填充第一个数据帧中的 Var1 和 Var2 列。更准确地说，对于第一个数据框中的每一行（即 Control_Date "t" 的产品），最后两列定义如下：

Var1：df2 中 Product_Category、Manufacture_Date 和 Country_Code 在 Year_t 中具有非 NA 值的产品总数；

Var2：df2 中 Product_Category 不同但 Manufacture_Date 和 Country_Code 在 Year_t 中具有非 NA 值的产品总数。

我最初使用嵌套 for 循环的解决方案如下：

for (i in unique(df1$Product_ID)){

    Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
    Opposite_Category <- ifelse(Category=="A","B","A")
    Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
    Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])

    ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
    ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]

    for (j in unique(df1$Control_Date)){
        df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
        df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
    }
}

这种方法的问题在于它需要很长时间才能运行。所以我想知道是否有人可以建议一个可以在更短的时间内完成这项工作的矢量化版本。

【问题讨论】：

嘿！请给我们您的数据的一个小样本。发布dput(yourDataframe[1:20,])的输出。
请make your question reproducible。
感谢您的回复 Georgery 和 Wimpel。我刚刚编辑了我的问题并为其添加了一个可重现的示例。
嗨，索拉博！您说第二个数据框包含 400,000 条记录，其中每一行代表一个唯一的产品 (Product_ID)，并附有其 Product_Category、Manufacture_Date 和 Country_Code。但是 Product_Category (2)、Manufacture_Date (90) 和 Country_Code (190) 的唯一组合数只有 34,200，而不是 400,000。你能澄清一下吗？是否有另一个变量会增加数据的大小？
你好@爱德华！感谢您的评论。三元组（Product_Category、Manufacture_Date、Country_Code）在第二个数据帧中不一定是唯一的。第二个数据帧的每一行唯一定义的确实是四元组（Product_ID、Product_Category、Manufacture_Date、Country_Code），即两个不同的产品（具有不同 ID）可能具有完全相同的 Product_Category、Manufacture_Date 和 Country_Code。这就是为什么行数不限于 34200 的原因。希望这能澄清！

标签： r loops vectorization data-cleaning

【解决方案1】：

看看这是否符合您的要求。我正在使用 data.table 包，因为您有一个相当大 (20M) 的数据集。

library(data.table)

setDT(df1)
setDT(df2)

# Set keys on the "triplet" to speed up everything
setkey(df1, Product_Category, Manufacture_Date, Country_Code)
setkey(df2, Product_Category, Manufacture_Date, Country_Code)

# Omit the Var1 and Var2 from df1
df1[, c("Var1", "Var2") := NULL]

# Reshape df2 to long form
df2.long <- melt(df2, measure=patterns("^Year_"))

# Split "variable" at the "_" to extract 4-digit year into "Control_Date" and delete leftovers.
df2.long[, c("variable","Control_Date") := tstrsplit(variable, "_", fixed=TRUE)][
  , variable := NULL]

# Group by triplet, Var1=count non-NA in value, join with... 
#   (Group by doublet, N=count non-NA), update Var2=N-Var1.
df2_N <- df2.long[, .(Var1 = sum(!is.na(value))), 
                   by=.(Product_Category, Manufacture_Date, Country_Code)][
                     df2.long[, .(N = sum(!is.na(value))), 
                              by=.(Manufacture_Date, Country_Code)], 
                     Var2 := N - Var1, on=c("Manufacture_Date", "Country_Code")]

# Update join: df1 with df2_N
df1[df2_N, c("Var1","Var2") := .(i.Var1, i.Var2), 
           on = .(Product_Category, Manufacture_Date, Country_Code)]

df1
   Product_ID Product_Category Manufacture_Date Control_Date Country_Code Var1 Var2
 1:          3                A             1940         1961          GHI    4    0
 2:          3                A             1940         1962          GHI    4    0
 3:          3                A             1940         1963          GHI    4    0
 4:          3                A             1940         1964          GHI    4    0
 5:          3                A             1940         1965          GHI    4    0
 6:          1                A             1950         1961          ABC    6    0
 7:          1                A             1950         1962          ABC    6    0
 8:          1                A             1950         1963          ABC    6    0
 9:          1                A             1950         1964          ABC    6    0
10:          1                A             1950         1965          ABC    6    0
11:          2                B             1960         1961          DEF   NA   NA
12:          2                B             1960         1962          DEF   NA   NA
13:          2                B             1960         1963          DEF   NA   NA
14:          2                B             1960         1964          DEF   NA   NA
15:          2                B             1960         1965          DEF   NA   NA

df2
   Product_ID Product_Category Manufacture_Date Country_Code Year_1961 Year_1962 Year_1963 Year_1964 Year_1965
1:          5                A             1940          DEF         6         3         6        10        NA
2:          3                A             1940          GHI        10         4        NA         9         7
3:          1                A             1950          ABC         5        NA         8        NA         6
4:          4                A             1950          ABC        NA         5         5        NA         4
5:          2                B             1940          DEF        NA        NA         6        NA        NA
6:          6                B             2000          GHI        NA        NA        NA        NA        NA

【讨论】：