Pandas DataFrame中连续NaN大于阈值答案

【问题标题】：Consecutive NaN larger than threshold in Pandas DataFramePandas DataFrame中连续NaN大于阈值
【发布时间】：2019-09-10 06:42:40
【问题描述】：

我想在 Pandas DataFrame 中找到那些连续 NaN 的索引，其中超过 3 个连续 NaN 返回它们的大小。那就是：

58234         NaN
58235         NaN
58236    0.424323
58237    0.424323
58238         NaN
58239         NaN
58240         NaN
58241         NaN
58242         NaN
58245         NaN
58246    1.483380
58247    1.483380

应该返回类似 (58238, 6) 的内容。返回的实际格式并不重要。我找到了以下内容。

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()

但它没有返回每个索引的正确值。这个问题可能与Identifying consecutive NaN's with pandas 非常相似但任何帮助都会非常感激，因为我是 Pandas 的菜鸟。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

我分解了步骤：

df['Group']=df.a.notnull().astype(int).cumsum()
df=df[df.a.isnull()]
df=df[df.Group.isin(df.Group.value_counts()[df.Group.value_counts()>3].index)]
df['count']=df.groupby('Group')['Group'].transform('size')
df.drop_duplicates(['Group'],keep='first')
Out[734]: 
        a  Group  count
ID                     
58238 NaN      2      6

【讨论】：

【解决方案2】：

假设 df 将它们作为两列命名为：A、B，这是一种矢量化方法 -

thresh = 3

a = df.A.values
b = df.B.values

idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=thresh) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
out = zip(out_num, out_count)

样本输入、输出-

In [285]: df
Out[285]: 
        A         B
0   58234       NaN
1   58235       NaN
2   58236  0.424323
3   58237  0.424323
4   58238       NaN
5   58239       NaN
6   58240       NaN
7   58241       NaN
8   58242       NaN
9   58245       NaN
10  58246  1.483380
11  58247  1.483380

In [286]: out
Out[286]: [(58238, 6)]

有了thresh = 2，我们有 -

In [288]: out
Out[288]: [(58234, 2), (58238, 6)]

【讨论】：

【解决方案3】：

所以这会有点慢，但我也是熊猫和python的学习新手。它超级丑陋，但在不了解您的数据集的情况下，我会这样做。

current_consec = 0
threeormore = 0

for i in dataset[whatever column you need]:
    if pd.isnull(i):
        if current_consec == 3:
            current_consec = 0
            threeormore += 1
        else:
            current_consec += 1
   else:
      current_consec = 0

因为它会以数字方式运行索引，它会找到按顺序运行的每个索引。唯一的问题是，如果您不想每次连续计算三个（连续锯 6 个），您将不得不稍微修改代码以不将 current_consec 更改为 0 并创建一个 pass 语句。

对不起，这是一个新的答案，但它可能会起作用，如果你发现更快的东西，请告诉我，因为我很乐意将它添加到我的知识库中。

祝你好运，

安迪·M

【讨论】：

谢谢安迪，但这确实太慢了，因为我的数据帧数超过 500000 行

【解决方案4】：

不幸的是，groupby 不适用于 NaN 值，所以这是一种有点肮脏的方式来做你想做的事（在我创建假列的意义上是肮脏的 >_>）。

顺便说一句，itertools.groupby 函数的工作方式是将具有相同键函数值的连续项目分组。 Enumerate 给出一个索引和 nanindices 的值（例如，如果 nanindices 是 [0,1,4,5,6]，枚举返回 [(0,0), (1,1), (2,4), (3, 5), (4, 6)])。关键功能是索引减去值。请注意，当值和索引同时上升 1（即连续）时，该差异是相同的。因此，这会将连续的数字分组。

itemgetter(n) 只是一个可调用对象，您可以将其应用于项目以使用它的 __getitem__ 函数获取它的第 n^th 元素。我将它映射到 groupby 的结果只是因为您不能直接在它返回的可迭代 g 上调用 length。如果您不想获取实际的连续值，您可以简单地将 g 转换为列表并调用长度。

import numpy as np
import pandas as pd
import itertools
from operator import itemgetter

locations = []
df = pd.DataFrame([np.NaN]*2+[5]*3+[np.NaN]*3+[4]*3+[3]*2+[np.NaN]*4, columns=['A'])
df['B'] = df.fillna(-1)
nanindices = df.reset_index().groupby('B')['index'].apply(np.array).loc[-1]
for k, g in itertools.groupby(enumerate(nanindices), lambda (i, x): i-x):
    consec = map(itemgetter(1), g)
    num_consec = len(consec)
    if (num_consec >= 3):
        locations.append((consec[0], num_consec))

print locations

对于我使用的 DF 示例，示例数据如下所示：

     A
0   NaN
1   NaN
2   5.0
3   5.0
4   5.0
5   NaN
6   NaN
7   NaN
8   4.0
9   4.0
10  4.0
11  3.0
12  3.0
13  NaN
14  NaN
15  NaN
16  NaN

然后程序打印：

[(5, 3), (13, 4)]

【讨论】：