【问题标题】:Find the intersection of grouped variables of a data frame in R [duplicate]在R中查找数据框的分组变量的交集[重复]
【发布时间】:2018-04-26 23:28:57
【问题描述】:

我有一个这样的数据框:

df <- data.frame(
  names = c(rep("cody", 10), rep("sam", 5)),
  year  = c(paste0("year",2000:2009), paste0("year",2000:2004))
)

我想得到这样的结果输出:

df2 <- data.frame(
  names = c(rep("cody", 5), rep("sam", 5)), 
  year  = c(paste0("year",2000:2004), paste0("year",2000:2004))
)

有什么想法吗?

【问题讨论】:

标签: r dataframe merge range overlap


【解决方案1】:

您可以按年份分组,然后筛选出现两次的年份(或任意多个您想要的唯一名称):

library(dplyr)

df %>% 
  group_by(year) %>% 
  mutate(name_count = n()) %>%
  ungroup() %>% 
  filter(name_count == 2) %>% 
  select(-name_count)

   names year    
   <fct> <fct>   
 1 cody  year2000
 2 cody  year2001
 3 cody  year2002
 4 cody  year2003
 5 cody  year2004
 6 sam   year2000
 7 sam   year2001
 8 sam   year2002
 9 sam   year2003
10 sam   year2004

【讨论】:

    【解决方案2】:

    这是带有Reduceintersect 的基本R 方法。

    dat[dat$year == Reduce(intersect, split(dat$year, dat$names)),]
    

    返回

      names     year
    1   cody year2000
    2   cody year2001
    3   cody year2002
    4   cody year2003
    5   cody year2004
    11   sam year2000
    12   sam year2001
    13   sam year2002
    14   sam year2003
    15   sam year2004
    

    在这里,我们使用Reduce 将参数(使用split 作为列表提供的每个名称的单独年份)重复提供给intersect,从而消除“不匹配”年份,直到您最终只得到那些适用于所有名称的年份。

    请注意,年份变量必须是字符向量,而不是因子变量。

    作为一个小的简化,您可以使用 with 来减少 dat$ 引用:

    dat[with(dat, year == Reduce(intersect, split(year, names))),]
    

    数据

    dat <- 
    structure(list(names = c("cody", "cody", "cody", "cody", "cody", 
    "cody", "cody", "cody", "cody", "cody", "sam", "sam", "sam", 
    "sam", "sam"), year = c("year2000", "year2001", "year2002", "year2003", 
    "year2004", "year2005", "year2006", "year2007", "year2008", "year2009", 
    "year2000", "year2001", "year2002", "year2003", "year2004")),
    .Names = c("names", "year"), row.names = c(NA, -15L), class = "data.frame")
    

    【讨论】:

      【解决方案3】:

      这是在year 列中查找所有重复项的选项。

      df[duplicated(df$year) | duplicated(df$year, fromLast = TRUE), ]
      #    names     year
      # 1   cody year2000
      # 2   cody year2001
      # 3   cody year2002
      # 4   cody year2003
      # 5   cody year2004
      # 11   sam year2000
      # 12   sam year2001
      # 13   sam year2002
      # 14   sam year2003
      # 15   sam year2004
      

      【讨论】:

        猜你喜欢
        • 2022-01-16
        • 2023-04-05
        • 1970-01-01
        • 2022-07-07
        • 2020-02-24
        • 1970-01-01
        • 2012-05-12
        • 2015-12-23
        • 2012-06-28
        相关资源
        最近更新 更多