R：根据条件从两个数据帧聚合答案

【问题标题】：R: Aggregate from two data frames on conditionsR：根据条件从两个数据帧聚合
【发布时间】：2015-03-16 15:22:25
【问题描述】：

我有一个名为“e”的数据框，其中包含来自平台的帖子，具有唯一的 entry_id 和 member_id：

row.    member_id   entry_id        timestamp
1       1            a              2008-06-09 12:41:00
2       1            b              2008-07-14 18:41:00
3       1            c              2010-07-17 15:40:00
4       2            d              2008-06-09 12:41:00
5       2            e              2008-09-18 10:22:00
6       3            f              2008-10-03 13:36:00

我有另一个名为“c”的数据框，其中包含 cmets：

row.    member_id   comment_id      timestamp
1       1            I              2007-06-09 12:41:00
2       1            II             2007-07-14 18:41:00
3       1            III            2009-07-17 15:40:00
4       2            IV             2007-06-09 12:41:00
5       2            V              2009-09-18 10:22:00
6       3            VI             2010-10-03 13:36:00

我想统计一个成员在发布条目之前写的所有 cmets。所以数据框“e”应该是这样的。阅读示例时只考虑年份。然而，解决方案也应该涵盖几分钟：

row.    member_id   entry_id    prev_comment_count  timestamp
1       1            a              2              2008-06-09 12:41:00
2       1            b              2              2008-07-14 18:41:00
3       1            c              3              2010-07-17 15:40:00
4       2            d              1              2008-06-09 12:41:00
5       2            e              1              2008-09-18 10:22:00
6       3            f              0              2008-10-03 13:36:00

我已经尝试过以下功能：

functionPrevComments <- function(givE)  nrow(subset
(c, (as.character(givE["member_id"]) == c["member_id"]) & 
(c["timestamp"] <= givE["timestamp"])))

但是当我尝试使用它时，我得到了错误

"Incompatible methods ("Ops.data.frame", "Ops.factor") for "<=""

我使用“$”运算符来引用我之前需要的列，但后来我得到了

"$ operator is invalid for atomic vectors "

如何正确应用我的功能，或者是否有其他更好的解决方案来解决我的问题？

最好的问候，

尼古拉斯

【问题讨论】：

标签： r dataframe conditional-statements aggregate multiple-columns

【解决方案1】：

e$type <- "entry"
c$type <- "comment"

names(e) <- c("row", "member_id", "action_id", "timestamp", "type")
names(c) <- c("row", "member_id", "action_id", "timestamp", "type")

DF <- rbind(e,c)
DF$timestamp <- as.POSIXct(DF$timestamp, 
                           format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
DF <- DF[order(DF$member_id, DF$timestamp),]
DF$count <- as.integer(ave(DF$type, 
                           DF$member_id, 
                           FUN = function(x) cumsum(x == "comment")))
DF[DF$type == "entry",]

#  row member_id action_id           timestamp  type count
#1   1         1         a 2008-06-09 12:41:00 entry     2
#2   2         1         b 2008-07-14 18:41:00 entry     2
#3   3         1         c 2010-07-17 15:40:00 entry     3
#4   4         2         d 2008-06-09 12:41:00 entry     1
#5   5         2         e 2008-09-18 10:22:00 entry     1
#6   6         3         f 2008-10-03 13:36:00 entry     0

如果这还不够快，可以使用 data.table 或 dplyr 进行改进。

【讨论】：

【解决方案2】：

这里有一个稍微不同的选项。确保在运行代码之前将两个“时间戳”列都转换为 POSIXct 类。

e$prev_comment_count <- sapply(seq_len(nrow(e)), function(i) {
  nrow(c[c$member_id == e$member_id[i] & c$timestamp < e$timestamp[i], ])
})

e
#  row. member_id entry_id           timestamp prev_comment_count
#1    1         1        a 2008-06-09 12:41:00                  2
#2    2         1        b 2008-07-14 18:41:00                  2
#3    3         1        c 2010-07-17 15:40:00                  3
#4    4         2        d 2008-06-09 12:41:00                  1
#5    5         2        e 2008-09-18 10:22:00                  1
#6    6         3        f 2008-10-03 13:36:00                  0

【讨论】：