根据一列重复单独的数据框答案

【问题标题】：Separate data frame depending on one column duplicates根据一列重复单独的数据框
【发布时间】：2023-04-03 13:48:01
【问题描述】：

我有一个包含很多行和列的大型数据框。在一列中有字符，其中一些只出现一次，另一些出现多次。我现在想分离整个数据框，这样我就得到了两个数据框，一个包含所有行的字符在这一列中重复的行，另一个包含所有行的字符只出现一次.比如：

One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)

> df
    One Two Three
1    1   4     a
2    2   5     b
3    3   3     c
4    4   6     d
5    5   2     d
6    6   7     e
7    7   1     f
8    8   8     e
9    9   1     g
10  10   9     c

我希望有两个类似的数据框

> dfSingle
    One Two Three
1    1   4     a
2    2   5     b
7    7   1     f
9    9   1     g

> dfMultiple
    One Two Three
3    3   3     c
4    4   6     d
5    5   2     d
6    6   7     e
8    8   8     e
10  10   9     c

我尝试了duplicated() 函数

dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))

但它不起作用，因为“c”、“d”和“e”中的第一个转到“dfSingle”。我也尝试做一个for循环

MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
  if(df$Three[i] %in% MulipleValues){
    dfMultiple[x,] = df[i,]
    x = x+1
    } else {
    dfSingle[y,] = df[i,]
    y = y+1
  }
}

它似乎做了正确的事情，因为数据框现在有正确的行数，但不知何故它们有 0 列。

> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows

我做错了什么？还是有其他方法可以做到这一点？

感谢您的帮助！

【问题讨论】：

标签： r dataframe duplicates subset

【解决方案1】：

在基础 R 中，我们可以将 split 与 duplicated 一起使用，这将返回两个数据帧的列表。

df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE))
df1

#$`FALSE`
#  One Two Three
#1   1   4     a
#2   2   5     b
#7   7   1     f
#9   9   1     g

#$`TRUE`
#   One Two Three
#3    3   3     c
#4    4   6     d
#5    5   2     d
#6    6   7     e
#8    8   8     e
#10  10   9     c

其中df1[[1]] 可以视为dfSingle，df1[[2]] 可以视为dfMultiple。

【讨论】：

【解决方案2】：

这是一个dplyr 一个有趣的，

library(dplyr)

df %>% 
 group_by(Three) %>% 
 mutate(new = n() > 1) %>% 
 split(.$new)

给出，

$`FALSE`
# A tibble: 4 x 4
# Groups:   Three [4]
    One   Two Three new  
  <dbl> <dbl> <fct> <lgl>
1     1     4 a     FALSE
2     2     5 b     FALSE
3     7     1 f     FALSE
4     9     1 g     FALSE

$`TRUE`
# A tibble: 6 x 4
# Groups:   Three [3]
    One   Two Three new  
  <dbl> <dbl> <fct> <lgl>
1     3     3 c     TRUE 
2     4     6 d     TRUE 
3     5     2 d     TRUE 
4     6     7 e     TRUE 
5     8     8 e     TRUE 
6    10     9 c     TRUE

【讨论】：

【解决方案3】：

dplyr的方式：

library(dplyr)

df %>%
  group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)

输出：

[[1]]
# A tibble: 4 x 4
    One   Two Three Duplicated
  <dbl> <dbl> <fct> <lgl>     
1     1     4 a     FALSE     
2     2     5 b     FALSE     
3     7     1 f     FALSE     
4     9     1 g     FALSE     

[[2]]
# A tibble: 6 x 4
    One   Two Three Duplicated
  <dbl> <dbl> <fct> <lgl>     
1     3     3 c     TRUE      
2     4     6 d     TRUE      
3     5     2 d     TRUE      
4     6     7 e     TRUE      
5     8     8 e     TRUE      
6    10     9 c     TRUE

【讨论】：

【解决方案4】：

你可以使用base R来做到这一点

One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)

str(df)

df$Three <- as.character(df$Three)
df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))

dfSingle = subset(df,df$count == 1)
dfMultiple = subset(df,df$count > 1)

【讨论】：