根据日期匹配变量并计算比率答案

【问题标题】：Matching on a variable, according to dates, and calculating ratios根据日期匹配变量并计算比率
【发布时间】：2018-09-13 21:02:59
【问题描述】：

我有一个数据框，我们称之为 df1，看起来像这样：

month            product_key          price
201408           00020e32-a64715      75
201408           00020e32-a64715      75
201408           000340b8-bacac8      20
201408           000458f1-fdb6ae      45
201408           00083ebb-e9c17f      250
201408           00207e67-15a59f      480
201408           002777d7-50bec1      12
201408           002777d7-50bec1      12
201409           00020e32-a64715      75
201409           000340b8-bacac8      20
201409           00083ebb-e9c17f      250
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201410           00083ebb-e9c17f      250
201410           00207e67-15a59f      480
201410           00207e67-15a59f      480
201410           0020baff-9730f0      39.99
201411           00083ebb-e9c17f      250
201411           00207e67-15a59f      480
201412           00083ebb-e9c17f      250
201501           00083ebb-e9c17f      200
201501           0020baff-9730f0      29.99

数据集中还有其他变量，但我们不需要它们。我的数据集是月度数据，范围从 2014 年年中到 2015 年末。每个月有数百种产品，一个月内可能有多次相同的产品。

我想要做的是识别在 8 月和 9 月都出现的产品，并删除两个月都没有出现的产品。然后我想计算每个月剩余产品的平均价格。然后我想用 9 月的平均价格除以 8 月的平均价格。在我的数据框中，这个计算出来的数字是 9 月的指数（8 月默认为 1，因为这是数据集开始的地方）。

然后我想在接下来的所有月份都这样做，所以我想识别在 9 月和 10 月都出现的产品，删除两个月都没有出现的产品，并计算平均价格（其余产品）每个月。然后我想将 10 月的平均价格除以 9 月的平均价格（这将与之前计算的 9 月平均价格不同，因为与 8 月都出现的产品相比，9 月和 10 月出现的产品很可能不同和九月）。这个计算出来的数字就是 10 月份的指数。所以我想在接下来的所有月份（10 月和 11 月、11 月和 12 月、12 月和 1 月、1 月和 2 月……等等）都这样做

我生成的数据框理想情况下看起来像这样（使用任意数字作为索引）：

month        index
201408       1
201409       1.0005      
201410       1.0152
201411       0.9997
201412       0.9551
201501       0.8985
201502       0.9754
201503       1.0045
201504       1.1520
201505       1.0148
201506       1.0452
201507       0.9945
201508       0.9751
201509       1.0004
201510       1.0415

当我尝试这样做时，我最终会在整个数据集上匹配产品，而不是连续 2 个月以上。我可以通过将数据集分解为每个月的大量数据集来做到这一点，但这似乎冗长乏味。我确定有更快的方法来做到这一点？

您可以使用下面的代码来创建测试数据集：

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
df1 <- data.frame(month, product_key, price)

举个例子说明我希望它如何工作 - 这是我为 8 月和 9 月创建索引所做的工作。

DF1Aug <- DF1 %>%
  filter(month %in% "201408") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


DF1Sept <- DF1 %>%
  filter(month %in% "201409") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


SeptPriceIndex <- transform(merge(DF1Aug, DF1Sept, by=c("product_key"), suffixes=c("_Aug", "_Sept"))) %>%
            mutate(AugAvgPrice=mean(aveprice_Aug)) %>%
            mutate(SeptAvgPrice=mean(aveprice_Sept)) %>%
            mutate(priceIndex = SeptAvgPrice/AugAvgPrice)

但是，这显然是一个乏味的过程，在我在数据框中的大约 20 个月内执行此操作（并且我需要在多个数据框上执行此操作）所以我想找到一种方法来自动化它。

【问题讨论】：

标签： r date dataframe dplyr data-manipulation

【解决方案1】：

使用dplyr 和tidy（更新）可以实现以下类似操作：

df %>% 
  # ensure data is sorted so that months are sequential by product key:
  arrange(product_key, month) %>% 
  # ensure every product month combo exists:
  complete(product_key, month) %>%  
  # create a sequential id within each product:
  group_by(product_key) %>% 
  mutate(grp_seq = row_number()) %>% 
  # remove product / month pairs without a price:
  filter(!is.na(price)) %>%
  # remove product keys that appear in only one month:
  filter(n_distinct(month) > 1) %>% 
  # remove non-consecutive product / month pairs:
  filter(lead(grp_seq) - 1 == grp_seq | lag(grp_seq) + 1 == grp_seq) %>% 
  # summarize the average price by month:
  group_by(month) %>% 
  summarize(avg_price = mean(as.numeric(price))) %>%
  # calculate the price index:
  mutate(index_price = avg_price / lag(avg_price)) 

# A tibble: 6 x 3
  month  avg_price index_price
  <chr>      <dbl>       <dbl>
1 201408      180.      NA    
2 201409      298.       1.65 
3 201410      403.       1.36 
4 201411      365.       0.905
5 201412      250.       0.685
6 201501      200.       0.800

【讨论】：

谢谢你 - 所以在你的第一个例子中，我最终得到了每个产品的价格指数，但是我想要每个月的价格指数。我知道这是您的第二个示例 - 但您的第二个示例与上个月到当前月份的产品不匹配。月份确实需要按时间顺序排列，所以我只想匹配 8 月和 9 月、9 月和 10 月、10 月和 11 月等的产品......我希望我的结果数据框看起来像你给出的第二个例子，但要包括第一个元素。因此，按时间顺序匹配产品。这有意义吗？ @sbha
谢谢 - 这更接近我的需要，但仍然不太正确。在示例情况下，因为我在 8 月出现了两个相同的产品，在您提供的代码中，它在 8 月的平均价格中包含了这个产品，但它不应该这样做，因为这个产品不会在 9 月出现。因此，如果有一种产品在一个月内出现多次的情况，即使它在下个月没有出现，我认为代码会在不应该出现的情况下将其提取并包含在平均价格中。 @sbha
对不起 - 我刚刚意识到另一件事我可能没有完全清楚。因此，每个月需要计算两个平均值（除了数据集中出现的第一个月和最后一个月）。因为连续两个月出现的产品，即8月和9月同时出现的产品，与接下来连续两个月出现的产品，即9月和10月都出现的产品不同。因此，9 月对于出现在 8 月和 9 月的产品将有一个平均值，对于出现在 9 月和 10 月的产品，将有另一个 9 月的平均值

【解决方案2】：

OP 希望通过计算所有经常性产品的所有记录价格的平均值并除以每月平均价格来获得随后两个月的价格指数。

这可能是 OP 的意图，但我不相信这是正确的方法：

根据 OP，一个月内可以多次出现同一产品。因此，如果一种产品的记录价格高于其他产品，则会对月平均价格以及价格指数产生更大的影响。
价格较高的产品将主导月均价格。因此，价格指数中较便宜产品的价格变化将不太明显。

示例

这是一个虚构的例子来解释我的意思。假设我们有两种产品。产品A 价格昂贵，4 月份有两个记录的价格，但 5 月份没有价格变化。产品B 很便宜，但它的价格在 5 月份已经翻了一番。所以，我的预期是价格指数将反映这一增长。

library(data.table)
example <- fread(
  "month   product_key price
  201704   A           90
  201704   A           110
  201704   B           1
  201705   A           100
  201705   B           2")

# OP's approach
example[, .(avg_price = mean(price)), by = month][
  , price_index := avg_price / shift(avg_price)][]

    month avg_price price_index
1: 201704        67          NA
2: 201705        51    0.761194

因此，根据 OP 的方法，价格指数已下降。

我认为正确的做法是

计算每种产品的平均每月价格
计算每个产品在随后几个月的价格指数
计算每个月产品的平均价格指数

（对于使用我更熟悉的data.table 语法，我深表歉意。我曾尝试使用dplyr 语法，但花了我太多时间。）

# compute average monthly price for each product
tmp1 <- example[, .(avg_price = mean(price)), keyby = .(product_key, month)]
tmp1

   product_key  month avg_price
1:           A 201704       100
2:           A 201705       100
3:           B 201704         1
4:           B 201705         2

# compute price index for each product
tmp2 <- tmp1[, price_index := avg_price / shift(avg_price), by = product_key][]
tmp2

   product_key  month avg_price price_index
1:           A 201704       100          NA
2:           A 201705       100           1
3:           B 201704         1          NA
4:           B 201705         2           2

# compute average price index
tmp2[, .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]

    month avg_price_index
1: 201704             NaN
2: 201705             1.5

现在，根据我的预期（可能不是 OP 的），价格指数显示上涨。

计算几个月的价格指数

OP 已要求计算几个月的价格指数，但仅限于随后几个月出现的产品。这可以通过移动月份的self join来解决。

请注意，简单的lag() 或shift() 在这里是危险的，因为它依赖于行顺序，如果缺少几个月就会失败。因此，日期算术用于找到正确的下个月。

sef join 方法还有一个额外的好处，即只考虑循环产品。如果 product_key 在下个月没有匹配，price 将是 NA。在计算平均价格指数时，这些条目将被删除。

library(data.table)
library(magrittr)
DF2 <- setDT(DF1)[
  # convert price from factor to numeric
  , price := price %>% as.character() %>% as.numeric()][
    # convert character month to Date
    , month := month %>% lubridate::ymd(truncated = 1L)][
      # compute average monthly price for each product
      , .(avg_price = mean(price)), keyby = .(product_key, month)]

# self join with subsequent month 
DF2[DF2[, .(product_key, month = month + months(1), avg_price)],
    on = .(product_key, month)][
      # compute price index for each product
      , price_index := avg_price / i.avg_price][
        # compute average price index
        , .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]

        month avg_price_index
1: 2014-09-01       0.8949772
2: 2014-10-01       1.0000000
3: 2014-11-01       1.0000000
4: 2014-12-01       1.0000000
5: 2015-01-01       0.8000000
6: 2015-02-01             NaN

数据

由 OP 提供

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
DF1 <- data.frame(month, product_key, price)

请注意，所有列都是因子。

【讨论】：