填写 dbplyr 中的缺失值答案

【问题标题】：Fill in missing values in dbplyr填写 dbplyr 中的缺失值
【发布时间】：2020-01-14 22:48:26
【问题描述】：

我的数据库中有如下数据：

ID    month_year   value
1     01/06/2014   10
1     01/07/2014   100
1     01/10/2014   25

我想填写缺少的月份：

ID    month_year   value
1     01/06/2014   10
1     01/07/2014   100
1     01/08/2014   NA
1     01/09/2014   NA
1     01/10/2014   25

我正在使用 BigQuery 包来使用 dbplyr。我知道这在 BigQuery 中是可能的 UNNEST(GENERATE_DATE_ARRAY(... 但我无法使用 dbplyr。可能与 this github issue 相关

【问题讨论】：

标签： r dbplyr bigrquery

【解决方案1】：

您可以使用外部连接来做到这一点。

list_of_dates = data_with_missing_dates %>%
  select(month_year) %>%
  distinct()

data_with_filled_dates = data_with_missing_dates %>%
  right_join(list_of_dates, by = "month_year")

这些都是标准的dplyr 命令，因此dbplyr 可以将它们翻译成bigquery。

以上假设您的现有数据在最终输出中包含您想要的所有日期（但分布在不同的 ID 值上），因此可以从您的初始数据集构造 list_of_dates。

如果您希望出现在最终数据中的初始数据中的任何 ID 都没有出现日期，那么您将需要以其他方式构造 list_of_dates。在这种情况下，即使是 complete() 本身也不够。

编辑，使每个 ID 都有自己的开始和结束

list_of_dates = data_with_missing_dates %>%
  select(month_year) %>%
  distinct() %>%
  mutate(placeholder = 1)

date_limits = data_with_missing_dates %>%
  group_by(ID) %>%
  summarise(min_date = min(month_year),
            max_date = max(month_year)) %>%
  mutate(placeholder = 1)

data_with_filled_dates = date_limits %>%
  outer_join(list_of_dates, by = "placeholder") %>%
  filter(min_date <= month_year,
         max_date >= month_year) %>%
  select(ID, month_year) %>%
  left_join(data_with_missing_dates, by = c("ID", "month_year"))

【讨论】：