【问题标题】:Combine duplicate rows in dataframe and create new columns合并数据框中的重复行并创建新列
【发布时间】:2018-09-17 07:16:54
【问题描述】:

我正在尝试聚合数据框中的行,这些行具有一些相似和不同的值,如下所示:

  dataframe1 <- data.frame(Company_Name = c("KFC", "KFC", "KFC", "McD", "McD"), 
                        Company_ID = c(1, 1, 1, 2, 2),
                        Company_Phone = c("237389", "-", "-", "237002", "-"),
                       Employee_Name = c("John", "Mary", "Jane", "Joshua", 
                     "Anne"),
                     Employee_ID = c(1001, 1002, 1003, 2001, 2002))

我希望将相似值的行合并,并为不同的值创建新列,如下所示:

   dataframe2 <- data.frame(Company_Name = c("KFC", "McD"), 
                     Company_ID = c(1,  2),
                     Company_Phone = c("237389", "237002"),
                     Employee_Name1 = c("John", "Joshua" ),
                     Employee_ID1 = c(1001, 2001),
                     Employee_Name2 = c("Mary", "Anne"),
                     Employee_ID2 = c(1002, 2002),
                     Employee_Name3 = c("Jane", "na"),
                     Employee_ID3 = c(1003, "na"))

我已经检查过类似的问题,例如 Combining duplicated rows in R and adding new column containing IDs of duplicatesR: collapse rows and then convert row into a new column,但我不想用逗号分隔值,而是创建新列。

 # Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
 #1          KFC          1        237389           John         1001           Mary         1002           Jane         1003
 #2          McD          2        237002         Joshua         2001           Anne         2002             na           na

提前谢谢你。

【问题讨论】:

  • @R noob 你想合并这两个数据框吗?显示您的预期输出。
  • @SORIFHOSSAINSHUJON 我已经编辑了我的问题以包含输出

标签: r dataframe


【解决方案1】:

使用 的解决方案。 dat 是最终输出。

library(tidyverse)

dat <- dataframe1 %>%
  mutate_if(is.factor, as.character) %>%
  mutate(Company_Phone = ifelse(Company_Phone %in% "-", NA, Company_Phone)) %>%
  fill(Company_Phone) %>%
  group_by(Company_ID) %>%
  mutate(ID = 1:n()) %>%
  gather(Info, Value, starts_with("Employee_")) %>%
  unite(New_Col, Info, ID, sep = "") %>%
  spread(New_Col, Value) %>%
  select(c("Company_Name", "Company_ID", "Company_Phone",
           paste0(rep(c("Employee_ID", "Employee_Name"), 3), rep(1:3, each = 2)))) %>%
  ungroup()

# View the result
dat %>% as.data.frame(stringsAsFactors = FALSE)
#   Company_Name Company_ID Company_Phone Employee_ID1 Employee_Name1 Employee_ID2 Employee_Name2 Employee_ID3 Employee_Name3
# 1          KFC          1        237389         1001           John         1002           Mary         1003           Jane
# 2          McD          2        237002         2001         Joshua         2002           Anne         <NA>           <NA>

【讨论】:

    【解决方案2】:

    我们可以使用来自data.tabledcast 来执行此操作,它可以占用多个value.var 列。将 'data.frame' 转换为 'data.table' (setDT(dataframe1)),按 'Company_Name' 分组,将 'Company_Phone' _ 元素替换为 first 字母数字字符串,然后将 dcast 来自 'long'通过将 'Employee_Name' 和 'Employee_ID' 指定为 value.var 列来达到 'wide'

    library(data.table)
    setDT(dataframe1)[, Company_Phone := first(Company_Phone), Company_Name]
    res <- dcast(dataframe1, Company_Name + Company_ID + Company_Phone ~ 
           rowid(Company_Name), value.var  = c("Employee_Name", "Employee_ID"), sep='')
    

    -输出

    res
    #Company_Name Company_ID Company_Phone Employee_Name1 Employee_Name2 Employee_Name3 Employee_ID1 Employee_ID2 Employee_ID3
    #1:          KFC          1        237389           John           Mary           Jane         1001         1002         1003
    #2:          McD          2        237002         Joshua           Anne             NA         2001         2002           NA
    

    如果我们需要订购它

    res[, c(1:3, order(as.numeric(sub("\\D+", "", names(res)[-(1:3)]))) + 3), with = FALSE]
    #   Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
    #1:          KFC          1        237389           John         1001           Mary         1002           Jane         1003
    #2:          McD          2        237002         Joshua         2001           Anne         2002             NA           NA
    

    【讨论】:

    • 比你非常...尽管输出略有不同。
    • @Rnoob 你的意思是列的顺序吗?这很容易纠正
    • 是的,列的顺序..你能把这个包含在你的答案中吗?请
    【解决方案3】:

    这是结合dplyrcSplit的另一种方法

    library(dplyr)
    dataframe1 <- dataframe1 %>%
      group_by(Company_Name, Company_ID) %>%
      summarise_all(funs(paste((.), collapse = ",")))
    
    library(splitstackshape)
    dataframe1 <- cSplit(dataframe1, c("Company_Phone", "Employee_Name", "Employee_ID"), ",")
    
    dataframe1
    #   Company_Name Company_ID Company_Phone_1 Company_Phone_2 Company_Phone_3 Employee_Name_1 Employee_Name_2 Employee_Name_3 Employee_ID_1 Employee_ID_2 Employee_ID_3
    #1:          KFC          1          237389               -               -            John            Mary            Jane          1001          1002          1003
    #2:          McD          2          237002               -              NA          Joshua            Anne              NA          2001          2002            NA
    

    【讨论】:

    • 非常感谢......尽管输出略有不同,但它确实可以完成工作
    • 不客气。这里的好处是您不需要指定要创建的列数。正如@akrun 所说,列的顺序可以很容易地纠正。
    猜你喜欢
    • 2018-09-04
    • 1970-01-01
    • 2021-11-12
    • 2016-12-25
    • 2016-12-23
    • 1970-01-01
    • 2017-05-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多