如何对 data.frame 列表应用条件重复删除？答案

【问题标题】：How to apply conditional duplicate removal for list of data.frame?如何对 data.frame 列表应用条件重复删除？
【发布时间】：2017-05-13 01:05:21
【问题描述】：

我有需要应用非常具体的重复删除方法的 data.frame 列表。我有理由为此 data.frame 列表使用特定的条件重复删除。但是，每个单独的 data.frame 的重复删除条件是不同的。我想对第一个列表元素进行完全重复删除；对于第二个列表元素，我需要搜索出现两次以上的行（频率> 2），并且只保留一行；对于第三个列表元素，搜索出现超过三次（频率> 3）的行，并在该 data.frame 中保留两行。我正在尝试为这个数据操作任务获得更多程序化、动态的解决方案。我尝试了我的镜头以获得很好的解决方案，但无法获得我想要的输出。我怎样才能让这很容易发生？关于我的特定输出，有什么方法可以更有效地完成这项任务？请问有什么办法吗？

可重现的data.frame：

myList <- list(
    bar= data.frame(start.pos=c(9,19,34,54,70,82,136,9,34,70,136,9,82,136),
                    end.pos=c(14,21,39,61,73,87,153,14,39,73,153,14,87,153),
                    pos.score=c(48,6,9,8,4,15,38,48,9,4,38,48,15,38)),
    cat = data.frame(start.pos=c(7,21,21,72,142,7,16,21,45,72,100,114,142,16,72,114),
                     end.pos=c(10,34,34,78,147,10,17,34,51,78,103,124,147,17,78,124),
                     pos.score=c(53,14,14,20,4,53,20,14,11,20,7,32,4,20,20,32)),
    foo= data.frame(start.pos=c(12,12,12,58,58,58,118,12,12,44,58,102,118,12,58,118),
                    end.pos=c(36,36,36,92,92,92,139,36,36,49,92,109,139,36,92,139),
                    pos.score=c(48,48,48,12,12,12,5,48,48,12,12,11,5,48,12,5))
)

因为myList是自定义函数的结果，data.frame不能分离。我正在寻求更多的程序化解决方案来为我的数据进行这种特定的重复删除。如果输入是 data.frame 列表，如何进行特定的重复删除？

我想要的输出如下：

expectedList <- list(
    bar= data.frame(start.pos=c(9,19,34,54,70,82,136),
                    end.pos=c(14,21,39,61,73,87,153),
                    pos.score=c(48,6,9,8,4,15,38)),
    cat= data.frame(start.pos=c(7,21,72,142,7,16,45,100,114,142,16,114),
                    end.pos=c(10,34,78,147,10,17,51,103,124,147,17,124),
                    pos.score=c(53,14,20,4,53,20,11,7,32,4,20,32)),
    foo= data.frame(start.pos=c(12,12,44,58,58,118,102,118,118),
                    end.pos=c(36,36,49,92,92,139,109,139,139),
                    pos.score=c(48,48,12,12,12,5,11,5,5))
)

编辑：

在第二个data.framecat中，我将查找出现3次的行，并且只保留一次；如果行出现两次，我不会对其进行重复删除。

对于第三个data.frame foo，我将检查出现超过三次的行，并保留两个相同的行。这就是我试图为每个 data.frame 进行非常具体的重复删除的内容。我怎样才能得到我的输出？

如何获得我想要的 data.frame 列表？我怎样才能让这很容易发生？非常感谢！

【问题讨论】：

这当然是可行的，但它的程序化程度可能会有一些限制，除非逻辑中有明确的模式。我认为模式是，对于每个列表项，您总是希望将允许重复的数量增加 1，对吧？
foo 的预期输出看起来错误。 (118, 139, 5) 出现了 3 次。
不确定预期的输出是否正确。也许library(data.table);Map(function(x,y) setDT(x)[x[, .I[(1:.N)<=y] , .(start.pos, end.pos, pos.score)]$V1], myList, 1:3)
@Hack-R 是的，我尝试过这种模式。我确信expectedList。可以得到我的输出列表吗？谢谢
@akrun 是的，我刚刚测试了它，我认为这应该是一个答案； Dan -- akrun 的解决方案输出一个列表out <- Map(function(x,y) setDT(x)[x[, .I[(1:.N)<=y] , .(start.pos, end.pos, pos.score)]$V1], myList, 1:3); class(out)"list"

标签： r dataframe duplicates

【解决方案1】：

我们可以这样做Map，根据使用向量中指定的相应数字 (1:3) 创建的逻辑索引对 list 元素的行进行子集化。将list中的data.frame元素转换为data.table(setDT(x))，按列分组('start.pos', 'end.pos', 'pos.score')，我们得到行数（.N），使用if/else创建一个逻辑索引并获取满足OP帖子中指定条件的行序列，使用.I获取行索引，提取该索引列（$V1）和使用它来对数据集进行子集化。

library(data.table)
res <- Map(function(x,y) setDT(x)[x[,  .I[if(.N > y) seq_len(pmax(y-1, 1)) 
        else seq_len(.N)]  , .(start.pos, end.pos, pos.score)]$V1], myList, 1:3)
sapply(res, nrow)
#bar cat foo 
#  7  12   9 

sapply(expectedList, nrow) 
#bar cat foo 
#7  12   9

【讨论】：

我能得到更多关于这个 data.table 解决方案的解释吗？使用.N、.()$V1 用于什么目的？我对 data.table 包很陌生。了解您的解决方案对于了解您的想法很有帮助。谢谢:)

【解决方案2】：

将以下函数应用于列表的每个数据框，指定每行的最大频率

removeDuplicate = function(df, freq=1) {

    # back up the dataframe and add a row id
    tmp = df;
    tmp$cnt = 1:NROW(df);
    # get each row frequency
    cnt = aggregate(cnt~., tmp, length);

    # merge the original data-frame and the row-frequency data-frame
    tmp = merge(df, cnt, by=names(df));
    tmp = rbind(
                tmp[tmp$cnt<=freq, names(df)], # keep all the rows which frequency is not greater than the max allowed
                cnt[, names(df)] # add all the other rows just once
            );

    return(tmp);

}

要将函数应用于每个数据框，我会这样做：

expectedList = myList
maxFreq = c(1, 2, 3)
for(i in 1:length(expectedList)) {

    expectedList[[i]] = removeDuplicate(expectedList[[i]], maxFreq[i])

}

但我认为可以找到使用lapply 的更优雅的解决方案...

【讨论】：

【解决方案3】：

# Separate individual dataframes
bar = myList$bar 
cat = myList$cat
foo = myList$foo

# We will need ddply command of plyr package
library(plyr)

#Count how many times the rows have repeated and put the value in the fourth column (V1)
bar = ddply(bar,.(start.pos,end.pos,pos.score),nrow)
cat = ddply(cat,.(start.pos,end.pos,pos.score),nrow)
foo = ddply(foo,.(start.pos,end.pos,pos.score),nrow)

# For each data.frame, change the number of repetions to appropriate number of times
# if the rows have repeated for more than the desired number of times
# i.e 1 for bar, 2 for cat, and 3 for foo
for (i in 1:nrow(bar)){
if (bar$V1[i] > 1){
bar$V1[i] = 1
}}
for (i in 1:nrow(cat)){
if (cat$V1[i] > 2){
cat$V1[i] = 1
}}
for (i in 1:nrow(foo)){
if (foo$V1[i] > 2){
foo$V1[i] = 2
}}

# Repeat each row for the number of times indicated in the fourth column.
# This will be 1 for bar, up to 2 for cat, and up to 3 for foo
bar = bar[rep(row.names(bar), bar[,4]), 1:3]
cat = cat[rep(row.names(cat), cat[,4]), 1:3]
foo = foo[rep(row.names(foo), foo[,4]), 1:3]

# Set the rownames to NULL if desired
rownames(cat) = NULL
rownames(bar) = NULL
rownames(foo) = NULL

# Combine the indivudal data.frames into a new list
expectedList = list(bar = bar,cat = cat,foo = foo)

【讨论】：