如何按一天中的时间对熊猫时间序列进行子集化答案

【问题标题】：How to subset pandas time series by time of day如何按一天中的时间对熊猫时间序列进行子集化
【发布时间】：2014-02-16 04:37:01
【问题描述】：

我正在尝试按一天中的时间对跨越多天的 pandas 时间序列进行子集化。例如，我只想要 12:00 到 13:00 之间的时间。

我知道如何在特定日期执行此操作，例如，

In [44]: type(test)
Out[44]: pandas.core.frame.DataFrame

In [23]: test
Out[23]:
                           col1
timestamp
2012-01-14 11:59:56+00:00     3
2012-01-14 11:59:57+00:00     3
2012-01-14 11:59:58+00:00     3
2012-01-14 11:59:59+00:00     3
2012-01-14 12:00:00+00:00     3
2012-01-14 12:00:01+00:00     3
2012-01-14 12:00:02+00:00     3

In [30]: test['2012-01-14 12:00:00' : '2012-01-14 13:00']
Out[30]:
                           col1
timestamp 
2012-01-14 12:00:00+00:00     3
2012-01-14 12:00:01+00:00     3
2012-01-14 12:00:02+00:00     3

但是我在任何日期都没有使用test.index.hour 或test.index.indexer_between_time() 来做这件事，它们都被建议作为类似问题的答案。我尝试了以下方法：

In [44]: type(test)
Out[44]: pandas.core.frame.DataFrame

In [34]: test[(test.index.hour >= 12) & (test.index.hour < 13)]
Out[34]:
Empty DataFrame
Columns: [col1]
Index: []

In [36]: import datetime as dt
In [37]: test.index.indexer_between_time(dt.time(12),dt.time(13))
Out[37]: array([], dtype=int64)

对于第一种方法，我不知道 test.index.hour 或 test.index.minute 实际返回的是什么：

In [41]: test.index
Out[41]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-14 11:59:56, ..., 2012-01-14 12:00:02]
Length: 7, Freq: None, Timezone: tzlocal()

In [42]: test.index.hour
Out[42]: array([11, 23,  0,  0,  0,  0,  0], dtype=int32)

In [43]: test.index.minute
Out[43]: array([59, 50,  0,  0, 50, 50,  0], dtype=int32)

他们返回什么？如何进行所需的子集化？理想情况下，我怎样才能让上述两种方法都起作用？

编辑：问题原来是索引无效，上面的Timezone: tzlocal() 证明了这一点，因为不应允许tzlocal() 作为时区。根据接受答案的最后部分，当我将生成索引的方法更改为pd.to_datetime() 时，一切都按预期工作。

【问题讨论】：

相关：stackoverflow.com/questions/21512042/…?
你的时区全是 loco，tzlocal() 不是时区；你是如何构建这个索引的？
@Jeff：谢谢。我刚刚注意到我的index 和working 示例中的唯一明显区别是freq 和Timezone。我的索引是用df.index = df['timestamp'].apply(dateutil.parser.parse) 在像Sat Jan 14 11:01:38 GMT 2012 这样的字符串上创建的。我想那是行不通的。修复很明显吗？
尝试使用df.index = pd.to_datetime(df['timestamp'])
@David：谢谢，那行得通。如果不解释我的错误，似乎我的问题有点没用。如果您想将该评论作为答案（例如在您现有答案的开头），我很乐意这样做。

标签： python pandas

【解决方案1】：

假设索引是有效的 pandas 时间戳，以下将起作用：

test.index.hour 返回一个数组，其中包含数据框中每一行的小时数。例如：

df = pd.DataFrame(randn(100000,1),columns=['A'],index=pd.date_range('20130101',periods=100000,freq='T'))

df.index.year 返回array([2013, 2013, 2013, ..., 2013, 2013, 2013])

要获取时间在 12 到 1 之间的所有行，请使用

df.between_time('12:00','13:00')

这将在几天/几年等时间内获取该时间范围。如果索引不是有效的时间戳，请使用 pd.to_datetime() 将其转换为有效的时间戳

【讨论】：

这是我在尝试之前的理解。 between_time('12:00','13:00') 返回一个空的 data.frame 和 test.index.year 返回 array([2012, 2189, 1970, 1970, 1970, 1970, 2189], dtype=int32)，而不是数组（[2012, 2012, 2012, 2012, 2012, 2012, 2012]）。问题中的示例表明 test.index.hour 和 test.index.minute 不分别返回行的小时和分钟。本质上，我的问题是，为什么行为不像您（和文档）声称的那样？显然我错过了一些东西。