如何从熊猫 df 中过滤少于 15 个条目的月份？答案

【问题标题】：How can i filter the months with less than 15 entries from a pandas df?如何从熊猫 df 中过滤少于 15 个条目的月份？
【发布时间】：2019-04-27 00:07:07
【问题描述】：

我有一个按年月日组织的多索引数据框，从 1960 年到 2017 年，我希望能够检查一个月是否包含超过 15 个 NaN。

有人可以帮我弄清楚如何以有效的方式做到这一点吗？

提前谢谢你。 Data frame

                           A    B   C   D   E   F   G   H
Year    Month   Day                             
1960    6        1  0.053142    0.632151    NaN -0.740130   NaN -1.273792   NaN -0.287078
                 2  0.827514    -0.487477   NaN -0.246897   NaN -0.310194   NaN 2.150300
                 3  -1.403216   0.350322    NaN 2.134335    NaN 0.023102    NaN 0.343759
                 4  0.305884    0.663174    NaN -2.073908   NaN 0.400311    NaN 0.149292
                 5  0.720521    -2.081981   NaN 0.672169    NaN -0.172794   NaN -0.549559
                 6  -0.987216   -1.190550   NaN 0.318706    NaN 0.863885    NaN -0.995961
                 7  1.781080    0.636422    NaN -0.382552   NaN -0.109566   NaN 0.410586
                 8  -0.654413   -0.094920   NaN -1.763118   NaN 0.075046    NaN -1.130280
                 9  -0.634353   -1.514066   NaN -0.003556   NaN -1.560351   NaN 1.001637
                 10 -1.742696   1.173806    NaN 0.909725    NaN -1.428291   NaN -1.369954

【问题讨论】：

请将 DF 放在代码块中而不是图像中...这让任何人都很难在这里为您提供帮助...如果您是一个月后，在所有条目中或仅在某些列等中有 15 个 NaN...

标签： python pandas filter timestamp conditional

【解决方案1】：

这样的东西在这里可能有用是一个例子 df:

# create a test dataframe similar to yours
df = pd.DataFrame(np.random.randn(10,8), columns=list('ABCDEFGH'))
df[['C', 'E', 'G']] = np.nan
df['Year'] = 1960
df['Month'] = 6
df['Day'] = range(1,11)

df2 = pd.DataFrame(np.random.randn(10,8), columns=list('ABCDEFGH'))
df2[['B']] = np.nan
df2['Year'] = 1960
df2['Month'] = 7
df2['Day'] = range(1,11)
new_df = pd.concat([df,df2])
new_df.set_index(['Year', 'Month', 'Day'], inplace=True)

那么你可以这样做：

# find all nan values then stack and groupby to find the sum of true  for each group
# this is grouping on year and month change the level/levels you want to group
stackdf = pd.isna(new_df).stack().groupby(level=[0,1]).transform(sum)

# filter original df where the index is in the stacked df index
# where the stackdf sum is greater than 15
new_df[new_df.index.isin(stackdf[stackdf>15].unstack().index)]

                       A    B   C   D   E   F   G   H
Year    Month   Day                             
1960    6        1  0.053142    0.632151    NaN -0.740130   NaN -1.273792   NaN -0.287078
                 2  0.827514    -0.487477   NaN -0.246897   NaN -0.310194   NaN 2.150300
                 3  -1.403216   0.350322    NaN 2.134335    NaN 0.023102    NaN 0.343759
                 4  0.305884    0.663174    NaN -2.073908   NaN 0.400311    NaN 0.149292
                 5  0.720521    -2.081981   NaN 0.672169    NaN -0.172794   NaN -0.549559
                 6  -0.987216   -1.190550   NaN 0.318706    NaN 0.863885    NaN -0.995961
                 7  1.781080    0.636422    NaN -0.382552   NaN -0.109566   NaN 0.410586
                 8  -0.654413   -0.094920   NaN -1.763118   NaN 0.075046    NaN -1.130280
                 9  -0.634353   -1.514066   NaN -0.003556   NaN -1.560351   NaN 1.001637
                 10 -1.742696   1.173806    NaN 0.909725    NaN -1.428291   NaN -1.369954

你也可以通过new_df[new_df.index.isin(stackdf[stackdf<15].unstack().index)]查看小于15岁的人

                       A    B   C   D   E   F   G   H
Year    Month   Day                             
1960     7       1  0.994542    NaN 0.488464    0.809915    0.144305    -1.092597   0.555626    0.012135
                 2  -0.682796   NaN -0.781031   -0.847972   0.238397    0.364584    -0.271764   0.930113
                 3  0.254320    NaN -0.474764   0.154370    -1.497867   -1.454383   0.191503    0.494441
                 4  0.994579    NaN 0.362073    -0.537878   -0.512388   -0.501573   0.315398    1.377701
                 5  0.623287    NaN 1.286725    -0.770290   -0.614005   0.552683    0.225974    -0.564017
                 6  -0.252969   NaN -1.127418   -0.357725   -1.069318   0.218666    1.296458    -0.319678
                 7  0.202788    NaN 0.385931    -0.169915   0.167754    0.821923    0.181937    -0.198668
                 8  -0.272891   NaN 0.963414    0.887208    -1.903742   -2.026687   0.897575    1.148448
                 9  1.398781    NaN -0.298804   -1.081953   -1.346193   0.926548    0.147855    -1.632059
                 10 0.489751    NaN 0.433767    0.752071    -0.714030   -1.776365   0.247908    0.919387

因为我使用的是堆栈，所以这是计算一组中的所有 NaN 值，而不是一个特定的列。

【讨论】：