python pandas .filter() 方法使用布尔掩码答案

【问题标题】：python pandas .filter() method using boolean maskpython pandas .filter() 方法使用布尔掩码
【发布时间】：2015-06-23 12:40:42
【问题描述】：

我有一个如下所示的数据框 (z)：

timestamp                   source  price
2004-01-05 14:55:09+00:00   Bank1   420.975
2004-01-05 14:55:10+00:00   Bank2   421.0
2004-01-05 14:55:22+00:00   Bank1   421.075
2004-01-05 14:55:34+00:00   Bank1   420.975
2004-01-05 14:55:39+00:00   Bank1   421.175
2004-01-05 14:55:45+00:00   Bank1   421.075
2004-01-05 14:55:52+00:00   Bank1   421.175
2004-01-05 14:56:12+00:00   Bank2   421.1
2004-01-05 14:56:33+00:00   Bank1   421.275

有时，在某些时间窗口中，银行 2 只提交 1 个报价 - 我需要像这样扔掉所有的日子，因为我需要银行的 2 个或更多报价。如果 Bank 2 出现 1 次或更少，则将这一天丢弃。

我通过创建一个布尔掩码来实现这一点，我计划从中过滤出满足条件的所有日期：

r = z.groupby([z.index.date, z['source']]).size() > 1 
    # return boolean for each day/source if it appears at least once
r = r.groupby(level=0).all() == True 
    # ie. if the datetime 0th-level index contains all True, return True, otherwise False (meaning one source failed the criteria)

这会产生：

2004-01-05  True
2004-01-06  True
2004-01-07  True
2004-01-08  False
2004-01-09  True

完美。现在我只需要从原始数据帧z 中过滤它，同时保持原始结构（即二级频率，而不是逐天）。这意味着使用 df.filter() 方法。

我的原始数据框具有相同的结构（并且它们的 .shape[0] 是相同的）：

2004-01-05  94
2004-01-06  24
2004-01-07  62
2004-01-08  30
2004-01-09  36

太棒了。

这就是我感到困惑的地方。我跑：

t = y.groupby(y.index.date).filter(lambda x: [x for x in r])

并接收TypeError: filter function returned a list, but expected a scalar bool。

基本上，我需要 lambda 函数简单地返回 r 中的每个 x（布尔值）。

我用一个非常复杂的方法解决了这个问题（只需将我之前解决的整个问题都放入 r 变量中，而是让它成为 lambda 函数的一部分）。

t = y.groupby(y.index.date).filter(lambda x: (x.groupby([x.index.date, x['source']]).size() > 1).groupby(level=0).all() == True) # ie. the datetime 0th-level index

这太乱了，必须有一个基本的说法，这是我的数据框z，然后是groupby('z.index.date')，然后是基于布尔掩码r的.filter()。

编辑：这是我从熊猫教程中找到的，但出于某种原因， .between_time() 部分不起作用。它过滤掉长度

t = y.groupby([y.index.date, y['source']]).filter(lambda x: len(x.between_time('14:00','15:00') > 1)

【问题讨论】：

标签： python pandas filter time-series

【解决方案1】：

您建议的原始方法是正确的，尽管您必须在组上使用transform（date 和source）而不是apply。 transform返回与原始数据框结构相同的组信息。

grp = z.groupby([z.index.date,z.source])
counts = grp.transform('count')  #counts the records for each group and index the information with the same structure of z

filtered_z = z[counts > 1] #final filtering

【讨论】：

我不确定如何在 r 这样的布尔系列上使用“计数”。我只需要在z 中引用r 中给出的日期。

【解决方案2】：

我想我在约会时想出了这个：

仅在数据框 z 中为日期创建新列

z['date'] = z.index.date

然后保留布尔系列中的日期r

z[z['date'].isin(r.index)]

【讨论】：