在 pandas 的列中获取具有相同值的行答案

【问题标题】：Get rows that have the same value across its columns in pandas在 pandas 的列中获取具有相同值的行
【发布时间】：2014-01-20 10:26:59
【问题描述】：

在 pandas 中，给定一个 DataFrame D：

+-----+--------+--------+--------+   
|     |    1   |    2   |    3   |
+-----+--------+--------+--------+
|  0  | apple  | banana | banana |
|  1  | orange | orange | orange |
|  2  | banana | apple  | orange |
|  3  | NaN    | NaN    | NaN    |
|  4  | apple  | apple  | apple  |
+-----+--------+--------+--------+

当有三列或更多列时，我如何返回所有列中内容相同的行以使其返回：

+-----+--------+--------+--------+   
|     |    1   |    2   |    3   |
+-----+--------+--------+--------+
|  1  | orange | orange | orange |
|  4  | apple  | apple  | apple  |
+-----+--------+--------+--------+

请注意，当所有值为 NaN 时，它会跳过行。

如果这只有两列，我通常会使用D[D[1]==D[2]]，但我不知道如何将其概括为超过 2 列的 DataFrame。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

类似于 Andy Hayden 的回答，检查 min 是否等于 max（然后行元素都是重复的）：

df[df.apply(lambda x: min(x) == max(x), 1)]

【讨论】：

【解决方案2】：

我的条目：

>>> df
        0       1       2
0   apple  banana  banana
1  orange  orange  orange
2  banana   apple  orange
3     NaN     NaN     NaN
4   apple   apple   apple

[5 rows x 3 columns]
>>> df[df.apply(pd.Series.nunique, axis=1) == 1]
        0       1       2
1  orange  orange  orange
4   apple   apple   apple

[2 rows x 3 columns]

这很有效，因为在行上调用 pd.Series.nunique 会给出：

>>> df.apply(pd.Series.nunique, axis=1)
0    2
1    1
2    3
3    0
4    1
dtype: int64

注意：但是，这会保留看起来像 [nan, nan, apple] 或 [nan, apple, apple] 的行。通常我想要那个，但这可能是您的用例的错误答案。

【讨论】：

关于注释，可以 dropna() nan 值。那么它应该可以正常工作，不是吗？
有没有一种简单的方法来修改它以保留第一行之类的行？（“苹果”、“香蕉”、“香蕉”）。我需要做类似的事情，但保留具有“至少”两个相等值的行。

【解决方案3】：

我会检查每一行的第一个元素是否为equal：

In [11]: df.eq(df[1], axis='index')  # Note: funky broadcasting with df == df[1]
Out[11]: 
      1      2      3
0  True  False  False
1  True   True   True
2  True  False  False
3  True   True   True
4  True   True   True

[5 rows x 3 columns]

如果该行中的所有元素都是True，那么该行中的所有元素都相同：

In [12]: df.eq(df[1], axis='index').all(1)
Out[12]: 
0    False
1     True
2    False
3     True
4     True
dtype: bool

仅限于行和可选的 dropna：

In [13]: df[df.eq(df[1], axis='index').all(1)]
Out[13]: 
        1       2       3
1  orange  orange  orange
3     NaN     NaN     NaN
4   apple   apple   apple

[3 rows x 3 columns]

In [14]: df[df.eq(df[1], axis='index').all(1)].dropna()
Out[14]: 
        1       2       3
1  orange  orange  orange
4   apple   apple   apple

[2 rows x 3 columns]

【讨论】：

【解决方案4】：

您可以使用 set 创建符合您的规则的索引位置列表，然后使用该列表对数据框进行切片。例如：

import pandas as pd
import numpy as np

D = {0  : ['apple' , 'banana', 'banana'], 1 : ['orange', 'orange', 'orange'], 2: ['banana', 'apple', 'orange'], 3: [np.nan, np.nan, np.nan], 4 : ['apple', 'apple', 'apple']} 
DF = pd.DataFrame(D).T

Equal = [row for row in DF.index if len(set(DF.iloc[row])) == 1]

DF.iloc[Equal]

请注意，这会排除缺失值行，而无需明确排除缺失值。这是因为序列中缺失值的性质。

【讨论】：

【解决方案5】：

基于DSM's answer，你可能想要这个方法：

import pandas as pd

def filter_data(df):
    df = df.dropna(inplace = True)
    df = df[df.apply(pd.Series.nunique, axis=1)]
    return df

【讨论】：

【解决方案6】：

在较新版本的 pandas 中，您可以使用nunique

In [815]: df[df.nunique(1).eq(1)]
Out[815]:
        0       1       2
1  orange  orange  orange
4   apple   apple   apple

详情

In [816]: df
Out[816]:
        0       1       2
0   apple  banana  banana
1  orange  orange  orange
2  banana   apple  orange
3     NaN     NaN     NaN
4   apple   apple   apple

In [817]: df.nunique(1)
Out[817]:
0    2
1    1
2    3
3    0
4    1
dtype: int64

In [818]: df.nunique(1).eq(1)
Out[818]:
0    False
1     True
2    False
3    False
4     True
dtype: bool

【讨论】：