【问题标题】:Transform duplicate rows to columns将重复的行转换为列
【发布时间】:2022-01-20 09:30:08
【问题描述】:

我正在处理一个包含数百个变量的数据库,但是,由于它的来源是 JSON,所以我很难组织它。例如,不是文件在列中带来信息,而是创建新行。请参阅示例。

df1 <- data_frame(ID = c(111,111,111,111,111,111,222,222,333),
                  NAME = c('JOHN','JOHN','MARY','MARY','JAMES','JAMES','WILL','WILL','MARK'),
                  ADRESS = c('NY','NY','NY','NY','ROMA','ROMA','LONDON','TOKYO',''),
                  COLOR = c('GREEN','GREEN','RED','RED','YELLOW','YELLOW','BLUE','BLUE','ORANGE'),
                  CAR = c('','','BMW','BMW','TRUCK','TRUCK','FORD','FORD','FERRARI'),
                  COUNTRY = c('USA','USA','USA','USA','USA','USA','USA','USA','USA'))

我想以按 ID 分组的方式组织文件,如下例所示:

df2 <- data_frame(ID = c(111,222,333),
                  NAME1 = c('JOHN','WILL','MARK'),
                  NAME2 = c('MARY','',''),
                  NAME3 = c('JAMES','',''),
                  ADRESS1 = c('NY','LONDON',''),
                  ADRESS2 = c('NY','TOKYO',''),
                  ADRESS3 = c('ROMA','',''),
                  COLOR1 = c('GREEN','BLUE','ORANGE'),
                  COLOR2 = c('RED','',''),
                  COLOR3 = c('YELLOW','',''),
                  CAR1 = c('','FORD','FERRARI'),
                  CAR2 = c('BMW','',''),
                  CAR3 = c('TRUCK','',''),
                  COUNTRY = c('USA','USA','USA'))

但是,请注意,COUNTRY 变量不需要有很多列(COUNTRY1、COUNTRY2、COUNTRY3),因为结果会重复。在我的原始文件中,我会发现很多这样的情况。 如何在 df2 中均匀排列数据?

【问题讨论】:

    标签: r reshape


    【解决方案1】:

    也许我们可以使用reshape 尝试以下基本 R 代码

    u <- reshape(
      transform(
        unique(df1),
        GRP = ave(seq_along(ID), ID, FUN = seq_along)
      ),
      direction = "wide",
      idvar = "ID",
      timevar = "GRP"
    )
    
    u[order(match(gsub("\\.\\d+", "", names(u)), names(df1)))]
    

    给了

    > u[order(match(gsub("\\.\\d+", "", names(u)), names(df1)))]
       ID NAME.1 NAME.2 NAME.3 ADRESS.1 ADRESS.2 ADRESS.3 COLOR.1 COLOR.2 COLOR.3
    1 111   JOHN   MARY  JAMES       NY       NY     ROMA   GREEN     RED  YELLOW
    7 222   WILL   WILL   <NA>   LONDON    TOKYO     <NA>    BLUE    BLUE    <NA>
    9 333   MARK   <NA>   <NA>              <NA>     <NA>  ORANGE    <NA>    <NA>
        CAR.1 CAR.2 CAR.3 COUNTRY.1 COUNTRY.2 COUNTRY.3
    1           BMW TRUCK       USA       USA       USA
    7    FORD  FORD  <NA>       USA       USA      <NA>
    9 FERRARI  <NA>  <NA>       USA      <NA>      <NA>
    

    【讨论】:

    • 谢谢。但我需要变量紧跟在转换后的变量之后,例如:NAME1,NAME2,NAME3, ADDRESS1, ADDRESS2, Etc...)
    • @BrunoAvila 查看我的更新
    • 谢谢。由于 COUNTRY 变量对每个人来说都是相同的,我希望它不要更改为 3,它只会保持 COUNTRY,就像我在 df2 中的示例一样
    【解决方案2】:

    pivot_wider也有一个选项

    library(dplyr)
    library(tidyr)
    library(data.table)
    distinct(df1) %>% 
      mutate(rn = rowid(ID)) %>%
      pivot_wider(names_from = rn, values_from = NAME:CAR, 
        names_sep = "", values_fill = "") %>%
      select(-COUNTRY, COUNTRY)
    

    -输出

    # A tibble: 3 × 14
         ID NAME1 NAME2  NAME3   ADRESS1  ADRESS2 ADRESS3 COLOR1 COLOR2 COLOR3   CAR1      CAR2   CAR3    COUNTRY
      <dbl> <chr> <chr>  <chr>   <chr>    <chr>   <chr>   <chr>  <chr>  <chr>    <chr>     <chr>  <chr>   <chr>  
    1   111 JOHN  "MARY" "JAMES" "NY"     "NY"    "ROMA"  GREEN  "RED"  "YELLOW" ""        "BMW"  "TRUCK" USA    
    2   222 WILL  "WILL" ""      "LONDON" "TOKYO" ""      BLUE   "BLUE" ""       "FORD"    "FORD" ""      USA    
    3   333 MARK  ""     ""      ""       ""      ""      ORANGE ""     ""       "FERRARI" ""     ""      USA    
    

    【讨论】:

    • 谢谢。由于 COUNTRY 变量对每个人来说都是相同的,我希望它不要更改为 3,它只会保持 COUNTRY,就像我在 df2 中的示例一样
    • @BrunoAvila 我认为您可能需要从values_from 中删除国家/地区(如更新中所示)
    猜你喜欢
    • 2015-08-06
    • 2013-10-01
    • 1970-01-01
    • 2019-10-14
    • 1970-01-01
    • 1970-01-01
    • 2023-03-14
    • 1970-01-01
    相关资源
    最近更新 更多