根据给定条件从数据框中过滤特定数据点答案

【问题标题】：Filtering specific data points from a dataframe based on a given conditions根据给定条件从数据框中过滤特定数据点
【发布时间】：2019-08-23 08:34:08
【问题描述】：

我有一个如下所示的数据框

+----------+-------+-------+-------+-------+-------+
|   Date   | Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
+----------+-------+-------+-------+-------+-------+
| 1-Jan-19 |    50 |     0 |    40 |    80 |    60 |
| 2-Jan-19 |    60 |    80 |    60 |    80 |    90 |
| 3-Jan-19 |    80 |    20 |     0 |    50 |    30 |
| 4-Jan-19 |    90 |    20 |    10 |    90 |    20 |
| 5-Jan-19 |    80 |     0 |    10 |    10 |     0 |
| 6-Jan-19 |   100 |    90 |   100 |     0 |    10 |
| 7-Jan-19 |    20 |    10 |    30 |    20 |     0 |
+----------+-------+-------+-------+-------+-------+

如果值为零，我想提取所有数据点（行标签和列标签）并生成一个新的数据框。

我想要的输出如下

+--------------+----------------+
| Missing Date | Missing column |
+--------------+----------------+
| 1-Jan-19     | Loc 2          |
| 3-Jan-19     | Loc 3          |
| 5-Jan-19     | Loc 2          |
| 5-Jan-19     | Loc 5          |
| 6-Jan-19     | Loc 4          |
| 7-Jan-19     | Loc 5          |
+--------------+----------------+

注意5-Jan-19，有两个条目Loc 2 & Loc 5。

我知道如何在 Excel VBA 中执行此操作。但是，我正在寻找具有python-pandas 的更具可扩展性的解决方案。

到目前为止，我已经尝试使用以下代码

import pandas as pd

df = pd.read_csv('data.csv')

new_df = pd.DataFrame(columns=['Missing Date','Missing Column'])

for c in df.columns:
    if c != 'Date':
        if df[df[c] == 0]:
            new_df.append(df[c].index, c)

我是熊猫新手。因此，请指导我如何解决此问题。

【问题讨论】：

你的尝试是什么？
用我的代码更新了。

标签： python pandas

【解决方案1】：

`melt` + `query`

(df.melt(id_vars='Date', var_name='Missing column')
   .query('value == 0')
   .drop(columns='value')
)

        Date Missing column
7   1-Jan-19          Loc 2
11  5-Jan-19          Loc 2
16  3-Jan-19          Loc 3
26  6-Jan-19          Loc 4
32  5-Jan-19          Loc 5
34  7-Jan-19          Loc 5

【讨论】：

虽然我自己设法解决了（检查上面的答案），但我不得不接受你的代码。优雅并喜欢它。！

【解决方案2】：

使用日期列作为id_vars 融化日期框，然后过滤值为零的位置（例如使用.loc[lambda x: x['value'] == 0]）。现在只是清理：

对Date 和Missing column 上的值进行排序
删除value 列（它们都包含零）
将Date 重命名为Missing Date
重置索引，删除原始索引

df = pd.DataFrame({
    'Date': pd.date_range('2019-1-1', '2019-1-7'),
    'Loc 1': [50, 60, 80, 90, 80, 100, 20],
    'Loc 2': [0, 80, 20, 20, 0, 90, 10],
    'Loc 3': [40, 60, 0, 10, 10, 100, 30],
    'Loc 4': [80, 80, 50, 90, 10, 0, 20],
    'Loc 5': [60, 90, 30, 20, 0, 10, 0],
})

df2 = (
    df
    .melt(id_vars='Date', var_name='Missing column')
    .loc[lambda x: x['value'] == 0]
    .sort_values(['Date', 'Missing column'])
    .drop('value', axis='columns')
    .rename({'Date': 'Missing Date'})
    .reset_index(drop=True)
)
>>> df2
        Date Missing column
0 2019-01-01          Loc 2
1 2019-01-03          Loc 3
2 2019-01-05          Loc 2
3 2019-01-05          Loc 5
4 2019-01-06          Loc 4
5 2019-01-07          Loc 5

【讨论】：

谢谢。赞成你的答案。您的代码类似于@ALollz 的答案。但是，我喜欢它。学到了一个新方法melt，顺便说一句，我自己回答了。检查上面的代码并将您的 cmets 给我以进一步改进。

【解决方案3】：

我设法用iterrows() 解决了这个问题。

import pandas as pd
df = pd.read_csv('data.csv')

cols = ['Missing Date','Missing Column']
data_points = []

for index, row in df.iterrows():
    for c in df.columns:
        if row[c] == 0:
            data_points.append([row['Date'],c])

df_final = pd.DataFrame(df_final = pd.DataFrame(data_points, columns=cols), columns=cols)

【讨论】：

很好的问题，但是您的答案本身并不是“pythonian”。检查ALollz的答案。他使用的工具是完成这项工作的正确工具。
是的。我接受了他的回答。但是，作为 python-pandas 中的一只新蜜蜂，我为自己解决了这个问题而感到自豪。
iterrows 对于较大的数据集将是一个非常缓慢的解决方案，应避免使用。
@Alexander 好的。很好记。
@Alexander @adhg 。是的你是对的。我使用iterrows() 的其他代码之一非常慢。因此，我问了一个新的question。再次需要你的帮助。

【解决方案4】：

我是疯狂的答案，

您可以使用日期：

new_dates = pd.np.repeat(df.index, df.eq(0).sum(axis=1).values)

如有必要，将df.index 替换为df['Date']。

对于价值观

cols = pd.np.where(df.eq(0), df.columns, pd.np.NaN) 
new_cols = cols[pd.notnull(cols)]

最后，

new_df = pd.DataFrame(new_cols, index=new_dates, columns =['Missing column'])

或者，您可以创建一个新列而不是索引。

现在它是如何工作的？

new_dates 获取该系列并重复每个值的次数与它们在该行中的 True 值一样多。我对每一行的 True 值求和，因为它们等于 1。意思是，当 df.eq(0) 时为真。

接下来，我调用一个过滤器，如果值为 0，则给出列名，否则为 NaN。

最后，我们只获取非 NaN 值并将它们放入一个数组中，我们最终会使用该数组来构建您的答案。

注意：我以玩具数据为例：

df = pd.DataFrame(
    {
        "A":pd.np.random.randint(0,3,20),                                                               
        "B":pd.np.random.randint(0,3,20),
        "C":pd.np.random.randint(0,3,20), 
        "D":pd.np.random.randint(0,3,20)
    }, 
    index = pd.date_range("2019-01-01", periods=20, freq="D")
)

【讨论】：

谢谢您并点赞。是的。你的答案也是独一无二的。我今天学到了很多。我设法自己解决了。也检查我上面的代码（很蹩脚，但我很自豪:)...感谢您的详细回答）

melt + query

`melt` + `query`