【问题标题】:pivot a dataframe without aggregation在没有聚合的情况下旋转数据框
【发布时间】:2021-02-09 20:57:19
【问题描述】:

目的是将数据帧(表示一对多关系:一台计算机与多台显示器)转换为更广泛的表示。

数据框(缩写)可以是:

library(tidyverse)
df <- tibble::tribble(
  ~CPU_ID,    ~ID, ~CONFIGITEM_NUMBER,        ~NAME, ~AllocationDate,                   ~Model,           ~Vendor,
  182434, 195251,       101142000825, "COMP000572",    "2014-04-10", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182434, 405022,         1142027261, "COMP030500",    "2020-12-02",                  "V173A",            "ACER",
  182436, 183607,       101142000008, "COMP000008",    "2014-04-18", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182437, 228469,         1142006861, "COMP020117",    "2018-03-05",              "S22C45KBW",         "Samsung",
  182437, 341806,         1142019822, "COMP050244",    "2019-01-09",                 "L1940T",              "HP",
  182438, 205930,       101142001009, "COMP050002",    "2019-05-20",              "S22C45KBW",         "Samsung",
  182439, 240546,         1142008622, "COMP050131",    "2016-09-16", "SAMSUNG SYNCMASTER 943",         "SAMSUNG",
  182462, 184114,       101142000515, "COMP000515",    "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182463, 184113,       101142000514, "COMP000514",    "2019-08-28", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182464, 184106,       101142000507, "COMP000507",    "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard"
)

我可以通过以下方式正确旋转它:


df %>%
  group_by(CPU_ID) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  rename_with( ~ paste0("monitor1_", .), .cols = !CPU_ID) %>%
  left_join(
    df %>%
      group_by(CPU_ID) %>%
      filter(row_number() == 2) %>%
      ungroup() %>%
      rename_with( ~ paste0("monitor2_", .), .cols = !CPU_ID),
    by = "CPU_ID"
  )
#> # A tibble: 8 x 13
#>   CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor
#>    <dbl>       <dbl>            <dbl> <chr>         <chr>            <chr>          <chr>
#> 1 182434      195251     101142000825 COMP000572    2014-04-10       HP ELITE DISP~ Hewlett-Packard
#> 2 182436      183607     101142000008 COMP000008    2014-04-18       HP ELITE DISP~ Hewlett-Packard
#> 3 182437      228469       1142006861 COMP020117    2018-03-05       S22C45KBW      Samsung
#> 4 182438      205930     101142001009 COMP050002    2019-05-20       S22C45KBW      Samsung
#> 5 182439      240546       1142008622 COMP050131    2016-09-16       SAMSUNG SYNCM~ SAMSUNG
#> 6 182462      184114     101142000515 COMP000515    2019-08-27       HP ELITE DISP~ Hewlett-Packard
#> 7 182463      184113     101142000514 COMP000514    2019-08-28       HP ELITE DISP~ Hewlett-Packard
#> 8 182464      184106     101142000507 COMP000507    2019-08-27       HP ELITE DISP~ Hewlett-Packard
#> # ... with 6 more variables: monitor2_ID <dbl>, monitor2_CONFIGITEM_NUMBER <dbl>,
#> #   monitor2_NAME <chr>, monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>

但在实际数据帧中,有每台计算机有两个以上显示器的情况,所以这个公式需要很多 left_join

我试图写一个替代方案,例如:

df %>%
  group_by(CPU_ID) %>%
  mutate(monitor_n = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = CPU_ID,
    names_from = monitor_n,
    values_from = !CPU_ID
  ) %>%
  select(-starts_with("monitor_n")) %>%
  rename_with(function(colname)
    str_replace(colname, "^(.*)_(\\d)$", "monitor\\2_\\1"),
    .cols = !CPU_ID)
#> # A tibble: 8 x 13
#>   CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME
#>    <dbl>       <dbl>       <dbl>            <dbl>            <dbl> <chr>         <chr>
#> 1 182434      195251      405022     101142000825       1142027261 COMP000572    COMP030500
#> 2 182436      183607          NA     101142000008               NA COMP000008    <NA>
#> 3 182437      228469      341806       1142006861       1142019822 COMP020117    COMP050244
#> 4 182438      205930          NA     101142001009               NA COMP050002    <NA>
#> 5 182439      240546          NA       1142008622               NA COMP050131    <NA>
#> 6 182462      184114          NA     101142000515               NA COMP000515    <NA>
#> 7 182463      184113          NA     101142000514               NA COMP000514    <NA>
#> 8 182464      184106          NA     101142000507               NA COMP000507    <NA>
#> # ... with 6 more variables: monitor1_AllocationDate <chr>, monitor2_AllocationDate <chr>,
#> #   monitor1_Model <chr>, monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>

但我需要按照与原始数据框相同的顺序维护列。

您能推荐其他更简单(更整洁)的替代方案吗?

【问题讨论】:

    标签: r tidyr


    【解决方案1】:

    类似于@Lennyy 的第二个解决方案,我建议先旋转更长的时间,然后再更宽地旋转。一个潜在的缺点是您至少需要暂时使它们都具有相同的类型,例如字符,但如有必要,您可以在最后转换任何字符。

    df %>%
      pivot_longer(cols = -CPU_ID, names_to = "variable", values_to = "value",
                   values_transform = list(value = as.character)) %>%
      group_by(CPU_ID, variable) %>%
      mutate(variable = paste(variable, row_number(), sep = "_")) %>%
      ungroup() %>%
      pivot_wider(names_from = variable, values_from = value)
    
    
    # A tibble: 8 x 13
      CPU_ID ID_1   CONFIGITEM_NUMBER… NAME_1   AllocationDate_1 Model_1        Vendor_1    ID_2  CONFIGITEM_NUMBE… NAME_2  AllocationDate_2 Model_2 Vendor_2
       <dbl> <chr>  <chr>              <chr>    <chr>            <chr>          <chr>       <chr> <chr>             <chr>   <chr>            <chr>   <chr>   
    1 182434 195251 101142000825       COMP000… 2014-04-10       HP ELITE DISP… Hewlett-Pa… 4050… 1142027261        COMP03… 2020-12-02       V173A   ACER    
    2 182436 183607 101142000008       COMP000… 2014-04-18       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA      
    3 182437 228469 1142006861         COMP020… 2018-03-05       S22C45KBW      Samsung     3418… 1142019822        COMP05… 2019-01-09       L1940T  HP      
    4 182438 205930 101142001009       COMP050… 2019-05-20       S22C45KBW      Samsung     NA    NA                NA      NA               NA      NA      
    5 182439 240546 1142008622         COMP050… 2016-09-16       SAMSUNG SYNCM… SAMSUNG     NA    NA                NA      NA               NA      NA      
    6 182462 184114 101142000515       COMP000… 2019-08-27       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA      
    7 182463 184113 101142000514       COMP000… 2019-08-28       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA      
    8 182464 184106 101142000507       COMP000… 2019-08-27       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA 
    

    【讨论】:

    • 最好不要尝试从“宽”转向“超宽”,而是先转向长,然后再转向“超宽”。
    【解决方案2】:

    也许是这样的?

    df %>% 
      group_by(CPU_ID) %>% 
      mutate(rowno = row_number()) %>% 
      ungroup %>% 
      gather(var, val, -CPU_ID, -rowno) %>% 
      mutate(newcolname = paste0("monitor", rowno, "_", var)) %>% 
      select(-c(var, rowno)) %>% 
      pivot_wider(names_from = newcolname, values_from = val)
    
    # A tibble: 8 x 13
      CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME monitor1_Alloca~ monitor2_Alloca~ monitor1_Model
       <dbl> <chr>       <chr>       <chr>            <chr>            <chr>         <chr>         <chr>            <chr>            <chr>         
    1 182434 195251      405022      101142000825     1142027261       COMP000572    COMP030500    2014-04-10       2020-12-02       HP ELITE DISP~
    2 182436 183607      NA          101142000008     NA               COMP000008    NA            2014-04-18       NA               HP ELITE DISP~
    3 182437 228469      341806      1142006861       1142019822       COMP020117    COMP050244    2018-03-05       2019-01-09       S22C45KBW     
    4 182438 205930      NA          101142001009     NA               COMP050002    NA            2019-05-20       NA               S22C45KBW     
    5 182439 240546      NA          1142008622       NA               COMP050131    NA            2016-09-16       NA               SAMSUNG SYNCM~
    6 182462 184114      NA          101142000515     NA               COMP000515    NA            2019-08-27       NA               HP ELITE DISP~
    7 182463 184113      NA          101142000514     NA               COMP000514    NA            2019-08-28       NA               HP ELITE DISP~
    8 182464 184106      NA          101142000507     NA               COMP000507    NA            2019-08-27       NA               HP ELITE DISP~
    # ... with 3 more variables: monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>
    

    也可以使用pivot_longer,但它会改变列的顺序(如果需要可以更正):

    df %>% 
      group_by(CPU_ID) %>% 
      mutate(rowno = row_number()) %>% 
      ungroup %>% 
      pivot_longer(-c(CPU_ID, rowno), names_to = "var", values_to = "val", values_transform = list(val = as.character)) %>% 
      mutate(newcolname = paste0("monitor", rowno, "_", var)) %>% 
      select(-c(var, rowno)) %>% 
      pivot_wider(names_from = newcolname, values_from = val)
    
    # A tibble: 8 x 13
      CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor monitor2_ID monitor2_CONFIG~ monitor2_NAME
       <dbl> <chr>       <chr>            <chr>         <chr>            <chr>          <chr>           <chr>       <chr>            <chr>        
    1 182434 195251      101142000825     COMP000572    2014-04-10       HP ELITE DISP~ Hewlett-Packard 405022      1142027261       COMP030500   
    2 182436 183607      101142000008     COMP000008    2014-04-18       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
    3 182437 228469      1142006861       COMP020117    2018-03-05       S22C45KBW      Samsung         341806      1142019822       COMP050244   
    4 182438 205930      101142001009     COMP050002    2019-05-20       S22C45KBW      Samsung         NA          NA               NA           
    5 182439 240546      1142008622       COMP050131    2016-09-16       SAMSUNG SYNCM~ SAMSUNG         NA          NA               NA           
    6 182462 184114      101142000515     COMP000515    2019-08-27       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
    7 182463 184113      101142000514     COMP000514    2019-08-28       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
    8 182464 184106      101142000507     COMP000507    2019-08-27       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
    # ... with 3 more variables: monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>
    

    【讨论】:

    • 谢谢。第二个建议是我需要的。
    【解决方案3】:

    为了记录,我最终使用的是(基于Lenny'sJon Spring's的答案):

    df %>%
      pivot_longer(
        cols = !CPU_ID,
        names_to = "variable",
        values_to = "value",
        values_transform = list(value = as.character)
      ) %>%
      group_by(CPU_ID, variable) %>%
      mutate(variable = paste0("monitor", row_number(), "_", variable)) %>%
      ungroup() %>%
      pivot_wider(names_from = variable, values_from = value)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-02-16
      • 2022-12-03
      • 1970-01-01
      • 1970-01-01
      • 2013-08-24
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多