【问题标题】:Transform a dataframe [duplicate]转换数据框[重复]
【发布时间】:2019-02-12 10:04:51
【问题描述】:
M     Product   Price
-------------------------
2014m1  Pepsi   55
2014m1  Coke    60
2014m2  Pepsi   55
2014m2  Coke    62
2014m3  Pepsi   55
2014m3  Coke    63
2014m4  Pepsi   55
2014m5  Pepsi   55
2014m6  Pepsi   55
2014m8  Pepsi   58
2014m9  Pepsi   58
2014m10 Pepsi   58
2014m11 Pepsi   58
2014m12 Pepsi   58

我有两个产品百事可乐和可口可乐的一些时间序列。我的目的是将这个表转换为下表。

M     Product Price
--------------------------
2014m1  Coke    60
2014m2  Coke    62
2014m3  Coke    63
2014m4  Coke    NA
2014m5  Coke    NA
2014m6  Coke    NA
2014m7  Coke    NA
2014m8  Coke    NA
2014m9  Coke    NA
2014m10 Coke    NA
2014m11 Coke    NA
2014m12 Coke    NA
2014m1  Pepsi   55
2014m2  Pepsi   55
2014m3  Pepsi   55
2014m4  Pepsi   55
2014m5  Pepsi   55
2014m6  Pepsi   55
2014m7  Pepsi   58
2014m8  Pepsi   58
2014m9  Pepsi   58
2014m10 Pepsi   58
2014m11 Pepsi   58
2014m12 Pepsi   58

即在这个表中,每个产品都有合适的月份和价格。那么有人可以帮我改造这个表吗?

【问题讨论】:

  • 您的原始 data.frame 没有百事可乐 2014m7 的值。这是笔误吗?
  • 对不起,这是百事可乐 2014m7 的错误。此观察有价值。您可以在第二个表中看到。

标签: r dataframe dplyr


【解决方案1】:

这是通过tidyr::expand 提供的更灵活的解决方案。您不必指定要添加的行数(在您的情况下为 12),因为我们使用 sub 来处理这一点。

library(tidyverse)

my_df %>% 
 mutate(val = max(as.integer(sub('.*m', '', M)))) %>% 
 group_by(Product) %>% 
 expand(M = paste0('2014m', seq(val[1]))) %>% 
 left_join(., my_df)

给出,

# A tibble: 24 x 3
# Groups:   Product [?]
   Product M       Price
   <chr>   <chr>   <int>
 1 Coke    2014m1     60
 2 Coke    2014m10    NA
 3 Coke    2014m11    NA
 4 Coke    2014m12    NA
 5 Coke    2014m2     62
 6 Coke    2014m3     63
 7 Coke    2014m4     NA
 8 Coke    2014m5     NA
 9 Coke    2014m6     NA
10 Coke    2014m7     NA
# ... with 14 more rows

【讨论】:

  • 谢谢你这真的有效!我还有一些问题要问。如何将此代码扩展到不同的年份(例如 2015、2016、2017),而不仅仅是 2014 年?
【解决方案2】:

您可以为此使用tidyr 中的complete。首先将M 转换为您希望在数据中包含所有级别的因子,然后使用完整来填充产品。

my_df %>% 
  mutate(M = factor(M, levels = paste0(2014, "m", 1:12))) %>%
  complete(M, Product)

# A tibble: 24 x 3
#    M      Product Price
#    <fct>  <chr>   <int>
#  1 2014m1 Coke       60
#  2 2014m1 Pepsi      55
#  3 2014m2 Coke       62
#  4 2014m2 Pepsi      55
#  5 2014m3 Coke       63
#  6 2014m3 Pepsi      55
#  7 2014m4 Coke       NA
#  8 2014m4 Pepsi      55
#  9 2014m5 Coke       NA
# 10 2014m5 Pepsi      55
# ... with 14 more rows

数据

my_df <- structure(list(M = c("2014m1", "2014m1", "2014m2", "2014m2", "2014m3", "2014m3", 
                     "2014m4", "2014m5", "2014m6", "2014m8", "2014m9", "2014m10", 
                     "2014m11", "2014m12"), 
               Product = c("Pepsi", "Coke", "Pepsi", "Coke", "Pepsi", "Coke", 
                           "Pepsi", "Pepsi", "Pepsi", "Pepsi", "Pepsi", "Pepsi",
                           "Pepsi", "Pepsi"), 
               Price = c(55L, 60L, 55L, 62L, 55L, 63L, 55L, 55L, 55L, 58L, 58L, 
                         58L, 58L, 58L)), 
          class = "data.frame", row.names = c(NA, -14L))

【讨论】:

    【解决方案3】:

    我们可以做的一种方法是创建一个包含所有可能组合的新数据框,然后使用原始数据框 merge 它们

    new_df <- data.frame(M = paste0(2014, "m", seq(12)), 
             Product = rep(unique(df$Product), each = 12))
    
    merge(new_df, df, all.x = TRUE)
    
    
    #         M  Product Price
    #1   2014m1    Coke    60
    #2   2014m1   Pepsi    55
    #3   2014m10   Coke    NA
    #4   2014m10  Pepsi    58
    #5   2014m11   Coke    NA
    #6   2014m11  Pepsi    58
    #7   2014m12   Coke    NA
    #8   2014m12  Pepsi    58
    #9   2014m2    Coke    62
    #10  2014m2   Pepsi    55
    ......
    

    这里df 是您的原始数据框。

    【讨论】:

      猜你喜欢
      • 2019-09-12
      • 2017-03-02
      • 1970-01-01
      • 2020-12-02
      • 2016-03-23
      • 2020-10-31
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多