dcast 中的自定义聚合函数答案

【问题标题】：custom aggregation function in dcastdcast 中的自定义聚合函数
【发布时间】：2016-10-12 18:18:49
【问题描述】：

我有一张表需要重新格式化。表格如下：

date   ItemID   NewPrice   Sale Amount
1-1     1         5            3
1-1     2         8            2
1-1     3         3            5
1-2     1         6            4
1-2     3         4            3
1-3     2         7            2
1-3     3         2            1

我要重新制定的第一个表格如下所示：

date   item_1    item_2    item_3
1-1      3         2         5 
1-2      4         0         3
1-3      0         2         1

item id 成为列名，value 是销售额。棘手的部分是，在某些日子里，有些项目没有记录，就像 1-2 中的项目 2 没有项目记录一样。在这种情况下，销售金额应填写为0。

我想重新制定的第二个表格如下所示：

date     item_1     item_2     item_3
1-1        5          8          3
1-2        6          8          4
1-3        6          7          2

所以我想要做的是使用 item_id 作为列，并使用 NewPrice 作为每个日期的值。

棘手的部分是，在每一天，总是有一些项目没有出现，所以那天没有这个项目的 NewPrice。在这种情况下，NewPrice 应该是最后一天的 NewPrice。

【问题讨论】：

你检查过dcast 和library(reshape2) 即dcast(dfN, date~paste0("item_", ItemID), value.var="SaleAmount", fill=0)

标签： r aggregation reshape2

【解决方案1】：

这是第一部分的基本 R 解决方案：

xtabs(`Sale Amount` ~ date + ItemID, DF)
##      ItemID
## date  1 2 3
##   1-1 3 2 5
##   1-2 4 0 3
##   1-3 0 2 1

对于第二部分，我们在动物园中使用na.locf 和tapply。 na.rm = FALSE 是在第一个日期有 NA 的情况下。在这种情况下，我们将其保留为 NA。

library(zoo)

na.locf(tapply(DF$NewPrice, DF[c("date", "ItemID")], c), na.rm = FALSE)
##      ItemID
## date  1 2 3
##   1-1 5 8 3
##   1-2 6 8 4
##   1-3 6 7 2

注意：输入DF的可重现形式为：

Lines <- "date   ItemID   NewPrice   'Sale Amount'
1-1     1         5            3
1-1     2         8            2
1-1     3         3            5
1-2     1         6            4
1-2     3         4            3
1-3     2         7            2
1-3     3         2            1"
DF <- read.table(text = Lines, header = TRUE, check.names = FALSE)

【讨论】：

第一个是 ace，但第二个没有像 OP 想要的那样填充。切线问题：有没有办法将xtabs/tables 强制转换为 data.frames 而不会重新整形为长格式？
函数tapply中的最后一个'c'是什么？是匹配函数吗？
在这种情况下，每个组只有一个元素，所以我们只需要将它返回。 c 有一个输入（这里就是这种情况）只返回那个输入。 identity 可以交替使用，但 c 更短。

【解决方案2】：

reshape2 的继任者是tidyr，它与dplyr 很好地集成在一起。您的第一个案例非常简单：

library(dplyr)
library(tidyr)

       # get rid of excess column
df %>% select(-NewPrice) %>% 
    # fix labels so they'll make nice column names
    mutate(ItemID = paste0('item_', ItemID)) %>% 
    # spread from long to wide, filling with 0 instead of NA
    spread(ItemID, Sale.Amount, fill = 0)

#   date item_1 item_2 item_3
# 1  1-1      3      2      5
# 2  1-2      4      0      3
# 3  1-3      0      2      1

对于第二个，显式使用fill，而不是spread中的参数：

       # get rid of excess column
df %>% select(-Sale.Amount) %>% 
    # fix labels so they'll make nice column names
    mutate(ItemID = paste0('item_', ItemID)) %>% 
    # spread from long to wide
    spread(ItemID, NewPrice) %>% 
    # fill NA values with previous value
    fill(-date)


#     date item_1 item_2 item_3
# 1    1-1      5      8      3
# 2    1-2      6      8      4
# 3    1-3      6      7      2

【讨论】：

【解决方案3】：

这可以使用dcast在一行中轻松完成

library(data.table)
dcast(setDT(dfN), date~paste0("item_", ItemID), value.var="Sale.Amount", fill=0)
#   date item_1 item_2 item_3
#1:  1-1      3      2      5
#2:  1-2      4      0      3
#3:  1-3      0      2      1

对于第二种情况，我们可以使用 na.locf 将 NA 值替换为以前的非 NA 值（在使用 dcast 重塑为“宽”之后）。

library(zoo)
dcast(setDT(dfN), date~paste0("item_", ItemID), value.var="NewPrice")[, 
          (2:4) := lapply(.SD, na.locf), .SDcols = item_1:item_3][]
#   date item_1 item_2 item_3
#1:  1-1      5      8      3
#2:  1-2      6      8      4
#3:  1-3      6      7      2

【讨论】：

有一个问题，你的 fun.aggregation 在哪里？如果不具体，会使用length()....
@Iserlohn 默认情况下，如果我们不使用 fun.aggregate，它会使用 length