Pandas Groupby nunique 计数基于 2 个日期列表的分组答案

【问题标题】：Pandas Groupby nunique count based on grouping of 2 date listsPandas Groupby nunique 计数基于 2 个日期列表的分组
【发布时间】：2020-05-21 21:49:40
【问题描述】：

与这个问题类似，但增加了一个步骤：Rolling groupby nunique count based on start and end dates

我有一个具有唯一 ID、开始日期、结束日期、开始年份和结束年份的数据框。在这段时间内，ID 可以启动、停止和重新启动。

我想在一年中获得一个 groupby nunique 的 ID 计数。目前，我可以计算 ID 的开始日期和结束日期的唯一值，但我如何准确地合并包括年份？

fun = pd.DataFrame({'ZIP_KEY': ['A', 'B', 'A'],
                   'start_month': [1, 2, 2],
                   'end_month': [4, 3, 7],
                   'start_year': [2016, 2016, 2017],
                   'end_year': [2016, 2017, 2018]})

fun["month_list"] = fun.apply(lambda x: list(range(x["start_month"], x["end_month"]+1)), axis=1)

fun["year_list"] = fun.apply(lambda x: list(range(x["start_year"], x["end_year"]+1)), axis=1)

fun = fun.explode("month_list")

fun = fun.explode("year_list")

fun.groupby(["year_list", "month_list"])["ZIP_KEY"].nunique()


year_list  month_list
2016       1             1
           2             2
           3             2
           4             1
2017       2             2
           3             2
           4             1
           5             1
           6             1
           7             1
2018       2             1
           3             1
           4             1
           5             1
           6             1
           7             1

如果 Zip Key 是多年的，我目前的方法没有考虑全年 --> 从 2018 年 1 月开始，到 2020 年 2 月结束，然后我们得到 [1,2] 和 [2018,2019,2020]，而不是 2018 年和 2019 年的完整年份。我应该得到计数 [1,2,3,4,5,6,7,8,9,10,11,12] 的 [2018, 2019] 和 [1,2] 的 2020 年

【问题讨论】：

标签： python pandas

【解决方案1】：

与我的其他答案类似，但这次我们使用pd.date_range 和'MS' 频率而不是range。首先创建datetime 列是很有帮助的，这些列是提供的年月组合的第一个月。

import pandas as pd

# Create start and end datetime column.
for per in ['start', 'end']:
    fun[per] = pd.to_datetime(fun[[f'{per}_year', f'{per}_month']]
                                  .rename(columns={f'{per}_year': 'year', f'{per}_month': 'month'})
                                  .assign(day=1))

df = pd.concat([pd.DataFrame({'date': pd.date_range(st, en, freq='MS'), 'key': k}) 
                for k, st, en in zip(fun['ZIP_KEY'], fun['start'], fun['end'])])

现在为输出分组。如果你想要单独的列：

df.groupby([df.date.dt.year.rename('year'), df.date.dt.month.rename('month')]).key.nunique()

year  month
2016  1        1 # <━┓
      2        2 # <━╋━━┓ 
      3        2 #   A  ┃
      4        2 # <━┛  ┃
      5        1 #      ┃
      6        1 #      ┃
      7        1 #      ┃
      8        1 #      B
      9        1 #      ┃
      10       1 #      ┃
      11       1 #      ┃
      12       1 #      ┃
2017  1        1 #      ┃
      2        2 # <━━━━╋━┓    
      3        2 # <━━━━┛ ┃
      4        1 #        ┃
      5        1 #        ┃
      6        1 #        ┃
      7        1 #        ┃
      8        1 #        ┃
      9        1 #        ┃
      10       1 #        A
      11       1 #        ┃
      12       1 #        ┃
2018  1        1 #        ┃
      2        1 #        ┃
      3        1 #        ┃
      4        1 #        ┃
      5        1 #        ┃
      6        1 #        ┃
      7        1 # <━━━━━━┛

我有时更喜欢按时期分组：

df.groupby(df.date.dt.to_period('M')).key.nunique()

【讨论】：