【问题标题】:Efficient subsetting of a data.frame based on another jagged data.frame基于另一个锯齿形 data.frame 的 data.frame 的有效子集
【发布时间】:2017-08-09 16:21:36
【问题描述】:

我正在做一个项目,我需要根据不同的属性组合重复对 data.frame 进行子集化。现在我正在使用合并函数对 data.frame 进行子集,因为我不知道在运行时输入的属性是什么,这很有效。但是,我想知道是否有更快的方法来创建子集。

require(data.table)
df <- structure(list(att1 = c("e", "a", "c", "a", "d", "e", "a", "d", "b", "a", "c", "a", "b", "e", "e", "c", "d", "d", "a", "e", "b"), 
                     att2 = c("b", "d", "c", "a", "e", "c", "e", "d", "e", "b", "e", "e", "c", "e", "a", "a", "e", "c", "b", "b", "d"), 
                     att3 = c("c", "b", "e", "b", "d", "d", "d", "c", "c", "d", "e", "a", "d", "c", "e", "a", "d", "e", "d", "a", "e"), 
                     att4 = c("c", "a", "b", "a", "e", "c", "a", "a", "b", "a", "a", "e", "c", "d", "b", "e", "b", "d", "d", "b", "e")), 
                .Names = c("att1", "att2", "att3", "att4"), class = "data.frame", row.names = c(NA, -21L))

#create combinations of attributes
#attributes to search through
cnames <- colnames(df)
att_combos <- data.table()
for(i in 2:length(cnames)){
  combos <- combn(cnames, i)
  for(x in 1:ncol(combos)){
    df_sub <- unique(df[,combos[1:nrow(combos), x]])
    att_combos <- rbind(att_combos, df_sub, fill = T)
  }
}
rm(df_sub, i, x, combos, cnames)
for(i in 1:nrow(att_combos)){
  att_sub <- att_combos[i, ]
  att_sub <- att_sub[, is.na(att_sub)==F, with = F]

  #need to subset data.frame here - very slow on large data.frames
  #anyway to speed this up?
  df_subset_for_analysis <- merge(df, att_sub)
}

【问题讨论】:

    标签: r dataframe data.table subset


    【解决方案1】:

    我会在你想要子集化的列上使用data.tablekeys,然后使用你感兴趣的组合生成data.table(在运行时),然后merge这两个。

    这是一个包含单个属性组合 (simple_combinations) 和一个包含多个属性组合 (multiple_combinations) 的示例:

    require(data.table)
    df <- structure(list(att1 = c("e", "a", "c", "a", "d", "e", "a", "d", "b", "a", "c", "a", "b", "e", "e", "c", "d", "d", "a", "e", "b"), 
                     att2 = c("b", "d", "c", "a", "e", "c", "e", "d", "e", "b", "e", "e", "c", "e", "a", "a", "e", "c", "b", "b", "d"), 
                     att3 = c("c", "b", "e", "b", "d", "d", "d", "c", "c", "d", "e", "a", "d", "c", "e", "a", "d", "e", "d", "a", "e"), 
                     att4 = c("c", "a", "b", "a", "e", "c", "a", "a", "b", "a", "a", "e", "c", "d", "b", "e", "b", "d", "d", "b", "e")), 
                .Names = c("att1", "att2", "att3", "att4"), class = "data.frame", row.names = c(NA, -21L))
    
    # Convert to data.table
    dt <- data.table(df)
    # Set key on the columns used for "subsetting"
    setkey(dt, att1, att2, att3, att4)
    
    # Simple subset on a single set of attributes
    simple_combinations <- data.table(att1 = "d", att2 = "e", att3 = "d", att4 = "e")
    setkey(simple_combinations, att1, att2, att3, att4)
    # Merge to generate simple output subset (simple_combinations of att present in dt)
    simple_subset <- merge(dt, simple_combinations)
    
    # Complex (multiple) sets of attributes
    multiple_combinations <- data.table(expand.grid(att1=c("d"), att2=c("c", "d", "e"),
      att3 = c("d"), att4 = c("b", "e")))
    setkey(multiple_combinations, att1, att2, att3, att4)
    # Merge to generate  output subset (multiple_combinations of att present in dt)
    multiple_subset <- merge(dt, multiple_combinations)
    

    输出在simple_subsetmultiple_subset

    【讨论】:

    • 仅供参考,data.table 有自己的 expand.grid 变体 CJ(尽管我不确定两者之间的权衡是什么)。此外,您可以与on 合并而无需设置密钥:simple = dt[.("d","e","d","e"), on=paste0("att",1:4), nomatch=0]; mult = dt[CJ(att1 = "d", att2 = c("c","d","e"), att3 = "d", att4 = c("b","e")), on=paste0("att",1:4), nomatch=0]
    • @Frank 谢谢 - 今天我学到了一些关于令人惊叹的 data.table 包的新知识。
    猜你喜欢
    • 2021-09-26
    • 2014-01-07
    • 2015-02-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多