基于值组合的阈值数量的子集data.frame答案

【问题标题】：Subset data.frame based on threshold number of value combinations基于值组合的阈值数量的子集data.frame
【发布时间】：2015-06-22 15:18:27
【问题描述】：

我想从我的 data.frame 中删除其唯一值组合在数据框中重复 >= 4 次的行。在此示例中，我只需要第 1、2、6 和 7 行，因为值 IR、IR_OSR、2 和 hello 在此示例中重复了 4 次。

> DB[1:5,c("MegaSite","General.location","ID","call.type")]
  MegaSite General.location ID call.type
1       IR           IR_OSR  2     hello
2       IR           IR_OSR  2     hello
3       IR           IR_OSR  M         x
4       IR           IR_OSR  M         x
5       IR           IR_OSR  M         z
6       IR           IR_OSR  2     hello
7       IR           IR_OSR  2     hello
        > dim(DB)
[1] 25434    76

我已经尝试了另一个最近的问题 (Finding value pairs that occur more than once in a data.table in R) 中建议的以下代码，

>DB[,.N>3 , list("MegaSite","General.location","ID","call.type")]

但是我得到了这个错误

Error in drop && !has.j : invalid 'x' type in 'x && y'

这是一个更大的示例数据集的链接，该数据集仅包含我的实际数据集中的相关列： DB_IRsample.txt

【问题讨论】：

标签： r dataframe

【解决方案1】：

试试这个代码：

> require(plyr)
> result <- ddply(r,.(MegaSite,General.location,ID,call.type),nrow)
> result <- result[result$V1 >= 4, ]
> result
  MegaSite General.location ID call.type V1
1       IR           IR_OSR  2     hello  4

然后你可以针对这个result合并原始数据来过滤掉至少没有出现4次的行：

> merge(r, result, all.y=TRUE, by=c("MegaSite", "General.location", "ID", "call.type"))
  MegaSite General.location ID call.type V1
1       IR           IR_OSR  2     hello  4
2       IR           IR_OSR  2     hello  4
3       IR           IR_OSR  2     hello  4
4       IR           IR_OSR  2     hello  4

【讨论】：

感谢这项工作！虽然我应该注意到我需要将合并应用于新对象，例如DB2