按自定义月份和日期对 pandas 数据框进行切片——有没有办法避免 for 循环？答案

【问题标题】：Slicing pandas dataframe by custom months and days -- is there a way to avoid for loops?按自定义月份和日期对 pandas 数据框进行切片——有没有办法避免 for 循环？
【发布时间】：2020-06-22 15:09:06
【问题描述】：

问题

假设我有一个时间序列数据框 df（一个 pandas 数据框），有些日子我想从中切分，包含在另一个名为 sample_days 的数据框中：

>>> df

                          foo       bar
2020-01-01 00:00:00  0.360049  0.897839
2020-01-01 01:00:00  0.285667  0.409544
2020-01-01 02:00:00  0.323871  0.240926
2020-01-01 03:00:00  0.921623  0.766624
2020-01-01 04:00:00  0.087618  0.142409
...                       ...       ...
2020-12-31 19:00:00  0.145111  0.993822
2020-12-31 20:00:00  0.331223  0.021287
2020-12-31 21:00:00  0.531099  0.859035
2020-12-31 22:00:00  0.759594  0.790265
2020-12-31 23:00:00  0.103651  0.074029

[8784 rows x 2 columns]

>>> sample_days

   month  day
0      3   16
1      7   26
2      8   15
3      9   26
4     11   25

我想用sample_days 中指定的日期对df 进行切片。我可以用 for 循环来做到这一点（见下文）。但是，有没有办法避免 for 循环（因为这样更有效）？结果应该是一个名为 sample 的数据框，如下所示：

>>> sample

                          foo       bar
2020-03-16 00:00:00  0.707276  0.592614
2020-03-16 01:00:00  0.136679  0.357872
2020-03-16 02:00:00  0.612331  0.290126
2020-03-16 03:00:00  0.276389  0.576996
2020-03-16 04:00:00  0.612977  0.781527
...                       ...       ...
2020-11-25 19:00:00  0.904266  0.825501
2020-11-25 20:00:00  0.269589  0.050304
2020-11-25 21:00:00  0.271814  0.418235
2020-11-25 22:00:00  0.595005  0.973198
2020-11-25 23:00:00  0.151149  0.024057

[120 rows x 2 columns

这只是df 在正确的日子里切分。

我的（慢）解决方案

我已经设法使用 for 循环和pd.concat：

sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
                           df.index.day.isin([sample_day.day])] 
                    for sample_day in sample_days.itertuples()])

它基于连接多个天，按照here 指示的方法切片。这给出了预期的结果，但速度相当慢。例如，使用此方法获取每个月的第一天平均需要 0.2 秒，而仅调用 df.loc[df.index.day == 1]（可能避免在底层使用 python for 循环）大约快 300 倍。然而，这只是一天的切片——我在一个月和一天切片。

抱歉，如果这已在其他地方得到回答 - 我已经搜索了很长时间，但可能没有使用正确的关键字。

【问题讨论】：

“for 循环以外的方式”并不总是更有效。通常，替代方案只是变相的循环。
@RobertHarvey 指出。我的理解是，许多 python 包仍然在底层执行 for 循环，但做得更快，例如用 C 等较低级别的语言执行循环。例如，在我的机器上，调用 np.sin(np.arange(100)) 大约是 50比[np.sin(i) for i in range(100)] 快几倍。而且，在我给出的示例中，使用 pandas 切片的速度大约快 300 倍。

标签： python pandas datetime time-series

【解决方案1】：

您可以同时对月份和日期进行字符串比较。

例如，您需要空格来区分11 2 和1 12，否则两者将被视为相同。

df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]

【讨论】：

【解决方案2】：

从@Ben Pap 的解决方案（谢谢！）中获得一些灵感后，我找到了一个既快速又避免任何“黑客”（例如将日期时间更改为字符串）的解决方案。它将月份和日期组合成一个 MultiIndex，如下所示（您可以将其设为单行，但我已将其扩展为多行以使概念清晰）。

full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
                                       names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]

如果我将此代码与我的原始 for 循环和 @Ben Pap 的答案一起运行，并从 2020 年的一年时间序列中采样 100 天（闰日为 8784 小时），我会得到以下解决时间：

原始for循环：0.16s
@Ben Pap 的解决方案，将月份和日期组合成单个字符串：0.019s
使用 MultiIndex 的上述解决方案：0.006s

所以我认为使用 MultiIndex 是可行的方法。

【讨论】：