【问题标题】:How to reindex a datetime-based multiindex in pandas如何在熊猫中重新索引基于日期时间的多索引
【发布时间】:2023-01-13 16:06:21
【问题描述】:

我有一个数据框,用于计算每个用户每天发生事件的次数。用户每天可能有 0 个事件,并且(因为该表是原始事件日志的聚合)数据框中缺少具有 0 个事件的行。我想添加这些缺失的行并按周对数据进行分组,以便每个用户每周有一个条目(如果适用,包括 0)。

这是我的输入示例:

import numpy as np
import pandas as pd

np.random.seed(42)

df = pd.DataFrame({
    "person_id": np.arange(3).repeat(5),
    "date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
    "event_count": np.random.randint(1, 7, 15),
})

# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])

df
|    |   person_id | date                |   event_count |
|---:|------------:|:--------------------|--------------:|
|  0 |           0 | 2022-01-01 00:00:00 |             4 |
|  1 |           0 | 2022-01-02 00:00:00 |             5 |
|  2 |           0 | 2022-01-03 00:00:00 |             3 |
|  3 |           0 | 2022-01-04 00:00:00 |             5 |
|  4 |           0 | 2022-01-05 00:00:00 |             5 |
|  5 |           1 | 2022-01-06 00:00:00 |             2 |
|  6 |           1 | 2022-01-07 00:00:00 |             3 |
|  7 |           1 | 2022-01-08 00:00:00 |             3 |
|  8 |           1 | 2022-01-09 00:00:00 |             3 |
|  9 |           1 | 2022-01-10 00:00:00 |             5 |
| 10 |           2 | 2022-01-11 00:00:00 |             4 |
| 11 |           2 | 2022-01-12 00:00:00 |             3 |
| 12 |           2 | 2022-01-13 00:00:00 |             6 |
| 13 |           2 | 2022-01-14 00:00:00 |             5 |
| 14 |           2 | 2022-01-15 00:00:00 |             2 |

这就是我想要的结果:

|    |   person_id | level_1             |   event_count |
|---:|------------:|:--------------------|--------------:|
|  0 |           0 | 2022-01-02 00:00:00 |             9 |
|  1 |           0 | 2022-01-09 00:00:00 |            13 |
|  2 |           0 | 2022-01-16 00:00:00 |             0 |
|  3 |           0 | 2022-01-23 00:00:00 |             0 |
|  4 |           1 | 2022-01-02 00:00:00 |             0 |
|  5 |           1 | 2022-01-09 00:00:00 |            11 |
|  6 |           1 | 2022-01-16 00:00:00 |             5 |
|  7 |           1 | 2022-01-23 00:00:00 |             0 |
|  8 |           2 | 2022-01-02 00:00:00 |             0 |
|  9 |           2 | 2022-01-09 00:00:00 |             0 |
| 10 |           2 | 2022-01-16 00:00:00 |            20 |
| 11 |           2 | 2022-01-23 00:00:00 |             0 |

我可以使用以下方法生产它:

(
    df
    .groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
    .groupby("person_id").apply(
        lambda df: (
            df
            .reset_index(drop=True, level=0)
            .reindex(desired_index, fill_value=0))
        )
    .reset_index()
)

但是,根据reindex 的文档,我应该可以直接将它与level=1 一起用作 kwarg,而无需再执行另一个 groupby。但是,当我这样做时,我得到了两个索引的“内部连接”而不是“外部连接”:

result = (
    df
    .groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
    .reindex(desired_index, level=1)
    .reset_index()
)
|    |   person_id | date                |   event_count |
|---:|------------:|:--------------------|--------------:|
|  0 |           0 | 2022-01-02 00:00:00 |             9 |
|  1 |           0 | 2022-01-09 00:00:00 |            13 |
|  2 |           1 | 2022-01-09 00:00:00 |            11 |
|  3 |           1 | 2022-01-16 00:00:00 |             5 |
|  4 |           2 | 2022-01-16 00:00:00 |            20 |

为什么会这样,我应该如何正确使用df.reindex


我在重新索引多索引级别时找到了a similar SO question,但那里接受的答案使用df.unstack,这对我不起作用,因为并不是我想要的索引的每个级别都出现在我当前的索引中(反之亦然) .

【问题讨论】:

    标签: python pandas multi-index datetimeindex


    【解决方案1】:

    利用:

    mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index], 
                                     names=['person_id','date'])
    result = (
        df
        .groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
        .reindex(mux, fill_value=0)
        .reset_index()
    )
    print (result)
        person_id       date  event_count
    0           0 2022-01-02            9
    1           0 2022-01-09           13
    2           0 2022-01-16            0
    3           0 2022-01-23            0
    4           1 2022-01-02            0
    5           1 2022-01-09           11
    6           1 2022-01-16            5
    7           1 2022-01-23            0
    8           2 2022-01-02            0
    9           2 2022-01-09            0
    10          2 2022-01-16           20
    11          2 2022-01-23            0
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-07-20
      • 2016-07-12
      • 1970-01-01
      • 2015-02-22
      • 2016-05-28
      • 1970-01-01
      相关资源
      最近更新 更多