【问题标题】:Filling continuous pandas dataframe from sparse dataframe从稀疏数据帧填充连续的熊猫数据帧
【发布时间】:2012-11-02 11:01:25
【问题描述】:

我有一个字典名称 date_dict,由 datetime 日期键控,其值对应于观察的整数计数。我将其转换为带有审查观察的稀疏系列/数据框,我想加入或转换为具有连续日期的系列/数据框。讨厌的列表理解是我绕过熊猫显然不会自动将 datetime 日期对象转换为适当的 DateTime 索引这一事实的技巧。

df1 = pd.DataFrame(data=date_dict.values(),
                   index=[datetime.datetime.combine(i, datetime.time()) 
                          for i in date_dict.keys()],
                   columns=['Name'])
df1 = df1.sort(axis=0)

此示例有 1258 个观测值,DateTime 索引从 2003-06-24 到 2012-11-07。

df1.head()
             Name
Date
2003-06-24   2
2003-08-13   1
2003-08-19   2
2003-08-22   1
2003-08-24   5

我可以创建一个具有连续 DateTime 索引的空数据框,但这会引入一个不需要的列并且看起来很笨重。我觉得好像我错过了一个更优雅的连接解决方​​案。

df2 = pd.DataFrame(data=None,columns=['Empty'],
                   index=pd.DateRange(min(date_dict.keys()),
                                      max(date_dict.keys())))
df3 = df1.join(df2,how='right')
df3.head()
            Name    Empty
2003-06-24   2   NaN
2003-06-25  NaN  NaN
2003-06-26  NaN  NaN
2003-06-27  NaN  NaN
2003-06-30  NaN  NaN

是否有更简单或更优雅的方法从稀疏数据帧填充连续数据帧,以便 (1) 连续索引,(2) NaN 为 0,以及 (3) 没有剩余的空数据框中的列?

            Name
2003-06-24   2
2003-06-25   0
2003-06-26   0
2003-06-27   0
2003-06-30   0

【问题讨论】:

    标签: python python-2.7 pandas


    【解决方案1】:

    您可以使用您的日期范围对时间序列使用重新索引。此外,您最好使用 TimeSeries 而不是 DataFrame(请参阅documentation),尽管重新索引也是向 DataFrame 添加缺失索引值的正确方法。

    例如,开头为:

    date_index = pd.DatetimeIndex([pd.datetime(2003,6,24), pd.datetime(2003,8,13),
            pd.datetime(2003,8,19), pd.datetime(2003,8,22), pd.datetime(2003,8,24)])
    
    ts = pd.Series([2,1,2,1,5], index=date_index)
    

    为您提供一个时间序列,例如您的示例数据框的头部:

    2003-06-24    2
    2003-08-13    1
    2003-08-19    2
    2003-08-22    1
    2003-08-24    5
    

    只是做

    ts.reindex(pd.date_range(min(date_index), max(date_index)))
    

    然后给你一个完整的索引,你的缺失值带有 NaN(如果你想用其他值填充缺失值,你可以使用 fillna - 请参阅here):

    2003-06-24     2
    2003-06-25   NaN
    2003-06-26   NaN
    2003-06-27   NaN
    2003-06-28   NaN
    2003-06-29   NaN
    2003-06-30   NaN
    2003-07-01   NaN
    2003-07-02   NaN
    2003-07-03   NaN
    2003-07-04   NaN
    2003-07-05   NaN
    2003-07-06   NaN
    2003-07-07   NaN
    2003-07-08   NaN
    2003-07-09   NaN
    2003-07-10   NaN
    2003-07-11   NaN
    2003-07-12   NaN
    2003-07-13   NaN
    2003-07-14   NaN
    2003-07-15   NaN
    2003-07-16   NaN
    2003-07-17   NaN
    2003-07-18   NaN
    2003-07-19   NaN
    2003-07-20   NaN
    2003-07-21   NaN
    2003-07-22   NaN
    2003-07-23   NaN
    2003-07-24   NaN
    2003-07-25   NaN
    2003-07-26   NaN
    2003-07-27   NaN
    2003-07-28   NaN
    2003-07-29   NaN
    2003-07-30   NaN
    2003-07-31   NaN
    2003-08-01   NaN
    2003-08-02   NaN
    2003-08-03   NaN
    2003-08-04   NaN
    2003-08-05   NaN
    2003-08-06   NaN
    2003-08-07   NaN
    2003-08-08   NaN
    2003-08-09   NaN
    2003-08-10   NaN
    2003-08-11   NaN
    2003-08-12   NaN
    2003-08-13     1
    2003-08-14   NaN
    2003-08-15   NaN
    2003-08-16   NaN
    2003-08-17   NaN
    2003-08-18   NaN
    2003-08-19     2
    2003-08-20   NaN
    2003-08-21   NaN
    2003-08-22     1
    2003-08-23   NaN
    2003-08-24     5
    Freq: D, Length: 62
    

    【讨论】:

    • 谢谢!我用 ts.reindex(pd.date_range(min(date_index), max(date_index)),fill_value=0)
    猜你喜欢
    • 2018-04-28
    • 2021-12-02
    • 2021-05-18
    • 2016-09-09
    • 2022-12-17
    • 1970-01-01
    • 2017-01-02
    • 2021-01-05
    • 2021-11-02
    相关资源
    最近更新 更多