基于覆盖多个月的年回报期的子集 Pandas DataFrame答案

【问题标题】：Subset Pandas DataFrame based on annual returning period covering multiple months基于覆盖多个月的年回报期的子集 Pandas DataFrame
【发布时间】：2018-02-10 07:00:33
【问题描述】：

这个问题类似于Selecting Pandas DataFrame records for many years based on month & day range，但是问题和答案似乎都没有涵盖我的情况

import pandas as pd
import numpy as np

rng = pd.date_range('2010-1-1', periods=1000, freq='D')
df = pd.DataFrame(np.random.randn(len(rng)), index=rng, columns=['A'])
df.head()

                   A
2010-01-01  1.098302
2010-01-02 -1.384821
2010-01-03 -0.426329
2010-01-04 -0.587967
2010-01-05 -0.853374

现在我想根据每年的年度回报期对我的 DataFrame 进行子集化。例如，一个时期可以定义为从 2 月 15 日到 10 月 3 日

startMM, startdd = (2,15)
endMM, enddd = (10,3)

现在我尝试根据这个时期对我的多年 DataFrame 进行切片：

subset = df[((df.index.month == startMM) & (startdd <= df.index.day) 
             | (df.index.month == endMM) & (df.index.day <= enddd))]

但这仅返回在startMM 和endMM 中定义的月份，而不是日期之间的实际时间段。任何帮助都将不胜感激。

subset.index.month.unique()

Int64Index([2, 10], dtype='int64')

【问题讨论】：

标签： python pandas datetime slice

【解决方案1】：

我会创建一列(month, day) 元组：

month_day = pd.concat([
                df.index.to_series().dt.month, 
                df.index.to_series().dt.day
            ], axis=1).apply(tuple, axis=1)

然后您可以直接比较它们：

df[(month_day >= (startMM, startdd)) & (month_day <= (endMM, enddd))]

【讨论】：

【解决方案2】：

替代解决方案：

In [79]: x = df.assign(x=df.index.strftime('%m-%d')) \
               .query("'02-15' <= x <= '10-03'").drop('x',1)

In [80]: x
Out[80]:
                   A
2010-02-15 -1.004663
2010-02-16  0.683352
2010-02-17  0.158518
2010-02-18 -0.447414
2010-02-19  0.078998
...              ...
2012-09-22  1.378253
2012-09-23  1.215885
2012-09-24  0.203096
2012-09-25 -1.666974
2012-09-26  0.231987

[687 rows x 1 columns]

In [81]: x.index.month.unique()
Out[81]: Int64Index([2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

【讨论】：