对消除数据帧 R 中重复数据的 for 循环进行矢量化答案

【问题标题】：Vectorizing a for-loop that eliminates duplicate data in dataframe R对消除数据帧 R 中重复数据的 for 循环进行矢量化
【发布时间】：2017-09-01 23:19:34
【问题描述】：

我正在处理 R 中的一个困难的数据操作问题。我目前正在使用 for 循环来解决该问题，但是我想对其进行矢量化以使其更好地扩展。我有以下数据框可供使用：

dput(mydf)
structure(list(team_id = c(14L, 14L, 7L, 7L, 21L, 21L, 15L, 15L
), opp_team_id = c(7L, 7L, 14L, 14L, 15L, 15L, 21L, 21L), pg = c(3211L, 
3211L, 786L, 786L, 3914L, 644L, 1524L, 593L), sg = c(653L, 4122L, 
1512L, 1512L, 2593L, 10L, 54L, 54L), sf = c(4122L, 1742L, 2347L, 
2347L, 1352L, 3378L, 2843L, 1062L), pf = c(1742L, 886L, 79L, 
1134L, 687L, 1352L, 1376L, 1376L), c = c(3014L, 2604L, 2960L, 
2960L, 21L, 3216L, 1256L, 3017L), opp_pg = c(3982L, 3982L, 3211L, 
4005L, 1524L, 1524L, 3914L, 644L), opp_sg = c(786L, 2347L, 653L, 
653L, 54L, 802L, 2593L, 10L), opp_sf = c(1134L, 1134L, 4122L, 
1742L, 1062L, 1062L, 3105L, 3105L), opp_pf = c(183L, 183L, 1742L, 
886L, 3017L, 1376L, 3216L, 2135L), opp_c = c(2475L, 2960L, 3138L, 
3138L, 1256L, 3017L, 21L, 1957L)), .Names = c("team_id", "opp_team_id", 
"pg", "sg", "sf", "pf", "c", "opp_pg", "opp_sg", "opp_sf", "opp_pf", 
"opp_c"), row.names = c(NA, -8L), class = "data.frame")

mydf
  team_id opp_team_id   pg   sg   sf   pf    c opp_pg opp_sg opp_sf opp_pf opp_c
1      14           7 3211  653 4122 1742 3014   3982    786   1134    183  2475
2      14           7 3211 4122 1742  886 2604   3982   2347   1134    183  2960
3       7          14  786 1512 2347   79 2960   3211    653   4122   1742  3138
4       7          14  786 1512 2347 1134 2960   4005    653   1742    886  3138
5      21          15 3914 2593 1352  687   21   1524     54   1062   3017  1256
6      21          15  644   10 3378 1352 3216   1524    802   1062   1376  3017
7      15          21 1524   54 2843 1376 1256   3914   2593   3105   3216    21
8      15          21  593   54 1062 1376 3017    644     10   3105   2135  1957

根据我手头的问题，第 3-4 行和第 7-8 行在此数据框中是重复的。第 3-4 行是第 1-2 行的副本，第 7-8 行是第 5-6 行的副本。这是体育数据，第 3-4 行本质上是第 1 行和第 2 行，但 team_id 和 opp_team_id 已切换，其他 10 列（大部分）相同。

这是我的用于删除重复项的 for 循环，我认为这很有创意，但仍然是一个 for 循环：

indices = c(1)
TFSwitch = TRUE
for(i in 2:nrow(mydf)) {
  last_row = mydf$team_id[(i-1)]
  this_row = mydf$team_id[i]

  TFSwitch = ifelse(last_row != this_row, !TFSwitch, TFSwitch)  

  if(TFSwitch == TRUE) {
    indices = c(indices, i)
  }
}

这个 for 循环来回检查 teamID 列是否逐行变化，如果发生变化，它将 TFSwitch 从 TRUE 切换到 FALSE，反之亦然。然后它将我想要保留的索引保存在一个向量中。

我想将其矢量化 - 任何想法将不胜感激！

【问题讨论】：

标签： r vectorization

【解决方案1】：

这与之前涉及成对重复删除的问题非常相似，例如：(pair-wise duplicate removal from dataframe)。因此，按照类似的过程，并添加一点merge() 以获取索引，您可以这样做：

vars <- c("team_id","opp_team_id")

mx <- do.call(pmax, mydf[vars])
mn <- do.call(pmin, mydf[vars])

merge(
  cbind(mydf[vars], ind=seq_len(nrow(mydf))),
  mydf[!duplicated(data.frame(mx,mn)), vars]
)[,"ind"]

# [1] 1 2 5 6

【讨论】：

谢谢你 - 它实际上在我的脚本中发现了一个错误。当一个 teamID 改变但另一个 teamID 没有改变时，我的 for 循环没有获取这些索引。

【解决方案2】：

这里使用data.table 的相同解决方案。我的理解是，您要删除成对的重复项，而不仅仅是查找唯一索引。

library(data.table)
setDT(mydf)
mydf[,c("id1","id2"):=list(pmax(team_id,opp_team_id),pmin(team_id,opp_team_id))]
setkey(mydf,team_id,opp_team_id)[unique(mydf,by=c("id1","id2"))]

【讨论】：