在 R 中随着时间的推移跟踪队列答案

【问题标题】：tracking a cohort over time in R在 R 中随着时间的推移跟踪队列
【发布时间】：2018-02-01 12:16:17
【问题描述】：

我有一个包含用户 ID 和交易月份的样本数据集。我的目标是逐月计算有多少原始用户进行了交易。换言之，1 月份的新用户有多少在 2 月、3 月和 4 月也进行了交易。 2 月份有多少新用户在 3 月和 4 月进行了交易，依此类推。

> data
       date user_id
1  Jan 2017       1
2  Jan 2017       2
3  Jan 2017       3
4  Jan 2017       4
5  Jan 2017       5
6  Feb 2017       1
7  Feb 2017       3
8  Feb 2017       5
9  Feb 2017       7
10 Feb 2017       9
11 Mar 2017       2
12 Mar 2017       4
13 Mar 2017       6
14 Mar 2017       8
15 Mar 2017      10
16 Apr 2017       1
17 Apr 2017       3
18 Apr 2017       6
19 Apr 2017       9
20 Apr 2017      12

这个数据集的输出看起来像这样：

> output
    Jan Feb Mar Apr
Jan   5   3   2   2
Feb  NA   2   0   1
Mar  NA  NA   3   1
Apr  NA  NA  NA   1

到目前为止，我能想到的唯一方法是拆分数据集，然后计算前几个月不存在的每个月的唯一 ID，但是这种方法很冗长，不适合大型数据集几个月。

subsets <-split(data, data$date, drop=TRUE)

for (i in 1:length(subsets)) {
  assign(paste0("M", i), as.data.frame(subsets[[i]]))
}

M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)


M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))

R 中是否有一种方法可以使用dplyr 甚至基础 R 以更短的方法得出上述输出？真实的数据集有很多年和几个月。

数据格式如下：

> sapply(data, class)
     date   user_id 
"yearmon" "integer"

以及样本数据：

> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")

【问题讨论】：

像library(data.table);setDT(data);dcast(data[,cohort:=min(date),by=user_id],cohort~date)这样的东西呢？
但是，如果用户在一个月内重复交易（例如，如果 user_id 1 在 1 月份进行两次交易，则上述代码在 1 月份计为 6。希望这是有道理的
吉普，有道理。如果我没听错，您可以将数据框包装在 unique 中。看我的回答。

标签： r dplyr time-series zoo

【解决方案1】：

这是一个例子：

library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
#      cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017        5        3        2        2
# 2: Feb 2017        0        2        0        1
# 3: Mrz 2017        0        0        3        1
# 4: Apr 2017        0        0        0        1

m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon") 
m
#           yearmon
# cohort     Jan 2017 Feb 2017 Mrz 2017 Apr 2017
#   Jan 2017        5        3        2        2
#   Feb 2017       NA        2        0        1
#   Mrz 2017       NA       NA        3        1
#   Apr 2017       NA       NA       NA        1

【讨论】：

【解决方案2】：

这在 Tidyverse 函数中也是可能的：

library(tidyverse)
library(lubridate)

transactions <- tibble(
  month=ymd(c("2017-01-01", "2017-01-01", "2017-02-01", "2017-02-01", "2017-03-01")),
  user_id=c(1, 2, 1, 3, 3)
)
#  Jan  1
#  Jan  2
#  Feb  1
#  Feb  3
#  Mar  1

# mark the cohort of the users
users <- transactions %>%
  arrange(month, user_id) %>%
  group_by(user_id) %>%
  top_n(-1, month) %>%
  # date of the first transaction
  rename(cohort = month)
users

transactions %>%
  group_by(month, user_id) %>%
  distinct() %>%
  left_join(users, by = 'user_id') %>%
  xtabs(~ cohort + month, data = .)
#            month
# cohort     2017-01-01 2017-02-01 2017-03-01
# 2017-01-01          2          1          0
# 2017-02-01          0          1          1

【讨论】：

我想要一个 tibble 作为输出，你可以使用：``` transactions %>% group_by(month, user_id) %>% distinct() %>% left_join(users, by = 'user_id' ) %>% group_by(cohort, month) %>% count(cohort, month) %>%arrange(cohort, month) %>% pivot_wider(names_from = month, values_from = n) ```