【问题标题】:Separate data frame depending on one column duplicates根据一列重复单独的数据框
【发布时间】:2023-04-03 13:48:01
【问题描述】:

我有一个包含很多行和列的大型数据框。在一列中有字符,其中一些只出现一次,另一些出现多次。我现在想分离整个数据框,这样我就得到了两个数据框,一个包含所有行的字符在这一列中重复的行,另一个包含所有行的字符只出现一次.比如:

One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)

> df
    One Two Three
1    1   4     a
2    2   5     b
3    3   3     c
4    4   6     d
5    5   2     d
6    6   7     e
7    7   1     f
8    8   8     e
9    9   1     g
10  10   9     c

我希望有两个类似的数据框

> dfSingle
    One Two Three
1    1   4     a
2    2   5     b
7    7   1     f
9    9   1     g

> dfMultiple
    One Two Three
3    3   3     c
4    4   6     d
5    5   2     d
6    6   7     e
8    8   8     e
10  10   9     c

我尝试了duplicated() 函数

dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))

但它不起作用,因为“c”、“d”和“e”中的第一个转到“dfSingle”。 我也尝试做一个for循环

MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
  if(df$Three[i] %in% MulipleValues){
    dfMultiple[x,] = df[i,]
    x = x+1
    } else {
    dfSingle[y,] = df[i,]
    y = y+1
  }
}

它似乎做了正确的事情,因为数据框现在有正确的行数,但不知何故它们有 0 列。

> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows

我做错了什么?还是有其他方法可以做到这一点?

感谢您的帮助!

【问题讨论】:

    标签: r dataframe duplicates subset


    【解决方案1】:

    在基础 R 中,我们可以将 splitduplicated 一起使用,这将返回两个数据帧的列表。

    df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE))
    df1
    
    #$`FALSE`
    #  One Two Three
    #1   1   4     a
    #2   2   5     b
    #7   7   1     f
    #9   9   1     g
    
    #$`TRUE`
    #   One Two Three
    #3    3   3     c
    #4    4   6     d
    #5    5   2     d
    #6    6   7     e
    #8    8   8     e
    #10  10   9     c
    

    其中df1[[1]] 可以视为dfSingledf1[[2]] 可以视为dfMultiple

    【讨论】:

      【解决方案2】:

      这是一个dplyr 一个有趣的,

      library(dplyr)
      
      df %>% 
       group_by(Three) %>% 
       mutate(new = n() > 1) %>% 
       split(.$new)
      

      给出,

      $`FALSE`
      # A tibble: 4 x 4
      # Groups:   Three [4]
          One   Two Three new  
        <dbl> <dbl> <fct> <lgl>
      1     1     4 a     FALSE
      2     2     5 b     FALSE
      3     7     1 f     FALSE
      4     9     1 g     FALSE
      
      $`TRUE`
      # A tibble: 6 x 4
      # Groups:   Three [3]
          One   Two Three new  
        <dbl> <dbl> <fct> <lgl>
      1     3     3 c     TRUE 
      2     4     6 d     TRUE 
      3     5     2 d     TRUE 
      4     6     7 e     TRUE 
      5     8     8 e     TRUE 
      6    10     9 c     TRUE 
      

      【讨论】:

        【解决方案3】:

        dplyr的方式:

        library(dplyr)
        
        df %>%
          group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)
        

        输出:

        [[1]]
        # A tibble: 4 x 4
            One   Two Three Duplicated
          <dbl> <dbl> <fct> <lgl>     
        1     1     4 a     FALSE     
        2     2     5 b     FALSE     
        3     7     1 f     FALSE     
        4     9     1 g     FALSE     
        
        [[2]]
        # A tibble: 6 x 4
            One   Two Three Duplicated
          <dbl> <dbl> <fct> <lgl>     
        1     3     3 c     TRUE      
        2     4     6 d     TRUE      
        3     5     2 d     TRUE      
        4     6     7 e     TRUE      
        5     8     8 e     TRUE      
        6    10     9 c     TRUE   
        

        【讨论】:

          【解决方案4】:

          你可以使用base R来做到这一点

          One = c(1,2,3,4,5,6,7,8,9,10)
          Two = c(4,5,3,6,2,7,1,8,1,9)
          Three = c("a", "b", "c", "d","d","e","f","e","g","c")
          df <- data.frame(One, Two, Three)
          
          str(df)
          
          df$Three <- as.character(df$Three)
          df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))
          
          dfSingle = subset(df,df$count == 1)
          dfMultiple = subset(df,df$count > 1)
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2015-11-14
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2015-12-02
            • 1970-01-01
            相关资源
            最近更新 更多