Python Pandas：列出每行具有相同值的列答案

【问题标题】：Python Pandas: list columns with the same values for each rowPython Pandas：列出每行具有相同值的列
【发布时间】：2017-03-14 16:12:47
【问题描述】：

我有一个看起来接下来的数据框：

import pandas as pd
import numpy as np

raw_data = {'col1': ['a', 'b', 'c', 'd', 'e'],
        'col2': [1, 2, 3, 4, np.nan],
        'col3': ['aa','b','cc','d','ff'],
        'col4': [4, 6, 3, 4, np.nan]
        }
df = pd.DataFrame(raw_data, columns = ['col1','col2','col3','col4']) 

 col1  col2 col3  col4
0    a   1.0   aa   4.0
1    b   2.0    b   6.0
2    c   3.0   cc   3.0
3    d   4.0    d   4.0
4    e   NaN   ff   NaN

我想为每一行找到具有相同值的所有列。所以结果应该是这样的：

Row 1: col1 eq col3;
Row 2: col2 eq col4;
Row 3: col1 eq col3; col2 eq col4

Dataframe 有 string 和 num 列，所以也许值得将所有内容都转换为 str。 NaN 数据值应该被忽略，因为有很多缺失 =)

非常感谢

【问题讨论】：

我支持@not_a_robot 的评论。我无法理解您的要求和陈述的结果。
重新发布我删除的评论：您能否澄清col1 在第 3 行（索引 2）如何等于 col3？我只看到 col1 的 c 和 col3 的 cc，它们在技术上是不相等的（尽管 c 是 cc 的真子集）。您所需的输出中显示的索引似乎已关闭...
我从 0 开始计算行数。我的错，我应该改用“索引”。值 'c' 和 'cc' 不应被视为相等。

标签： python pandas dataframe

【解决方案1】：

这是我想出的另一个答案。我不知道要为没有任何列具有相等值的行输出什么，所以我只是跳过输出中的那一行。还添加了一行，其中许多列具有相同的值，以显示那里发生的情况。

import pandas as pd
import numpy as np

raw_data = {'col1': ['a', 'b', 'c', 'd', 'e', 1],
        'col2': [1, 2, 3, 4, np.nan, 1],
        'col3': ['aa','b','cc','d','ff', 1],
        'col4': [4, 6, 3, 4, np.nan, 1],
        }
df = pd.DataFrame(raw_data, columns = ['col1','col2','col3','col4']) 

for row in df.itertuples():
    values = list(set(row))  # Get the unique values in the row
    equal_columns = []  # Keep track of column names that are the same
    for v in values:
        # Column names that have this value
        columns = [df.columns[i-1] for i, x in enumerate(row) if x == v]
        if len(columns) > 1:
            # If more than 1 column with this value, append to the list
            equal_columns.append(' eq '.join(columns))
    if len(equal_columns) > 0:
        # We have at least 1 set of equal columns
        equal_columns.sort()  # So we always start printing in lexicographic order
        print('Row {0}: {1};'.format(row.Index, '; '.join(equal_columns)))

给我输出，

Row 1: col1 eq col3;
Row 2: col2 eq col4;
Row 3: col1 eq col3; col2 eq col4;
Row 5: col1 eq col2 eq col3 eq col4;

【讨论】：

【解决方案2】：

这是您可以使用的 for 循环解决方案...也许 piRSquared 可以提出更好的纯熊猫解决方案。这应该在紧要关头起作用。

row_eqs = {}

# For each row
for idx in df.index:
    # Make a set of all "column equivalencies" for each row
    row_eqs[idx] = set()
    for col in df.columns:
        # Look at all of the other columns that aren't `col`        
        other_cols = [c for c in df.columns if c != col]
        # Column value
        col_row_value = df.loc[idx, col]
        for c in other_cols:
            # Other column row value
            c_row_value = df.loc[idx, c]
            if c_row_value == col_row_value:
                # Just make your strings here since lists and sets aren't hashable
                eq = ' eq '.join(sorted((c, col)))
                row_eqs[idx].add(eq)

打印结果：

for idx in row_eqs:
    if row_eqs[idx]:
        print('Row %d: %s' % (idx, '; '.join(row_eqs[idx])))

Row 1: col1 eq col3
Row 2: col2 eq col4
Row 3: col1 eq col3; col2 eq col4

编辑：一种稍快的处理方式，通过预先硬编码列组合对的总数：

column_combos = {combo for combo in itertools.combinations(df.columns, 2)}

for idx in df.index:
    row_eqs[idx] = set()
    for col1, col2 in column_combos:
        col1_value = df.loc[idx, col1]
        col2_value = df.loc[idx, col2]
        if col1_value == col2_value:
                eq = ' eq '.join(sorted((col1, col2)))
                row_eqs[idx].add(eq)

我不知道您的数据有多大，但后一种解决方案比前一种解决方案快 25%。

【讨论】：

我可能会尝试将事物转换为集合并使用<= 来确定它是否是子集。但我不会在没有澄清的情况下浪费时间。
谢谢！数据集并不大（~700Mb）。但是，每行具有相同值的列数是未知的。@piRSquared 感谢您抽出宝贵时间。

【解决方案3】：

假设我们有以下DF：

In [1]: from numpy import nan
   ...: from itertools import combinations
   ...: import pandas as pd
   ...: 
   ...: df = pd.DataFrame(
   ...: {'col1': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
   ...:  'col2': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: nan},
   ...:  'col3': {0: 'aa', 1: 'b', 2: 'cc', 3: 'd', 4: 'ff'},
   ...:  'col4': {0: 4.0, 1: 6.0, 2: 3.0, 3: 4.0, 4: nan},
   ...:  'col5': {0: nan, 1: 'b', 2: 'c', 3: nan, 4: 'e'}})
   ...:

In [2]: df
Out[2]:
  col1  col2 col3  col4 col5
0    a   1.0   aa   4.0  NaN
1    b   2.0    b   6.0    b
2    c   3.0   cc   3.0    c
3    d   4.0    d   4.0  NaN
4    e   NaN   ff   NaN    e

让我们使用相同数据类型的所有列组合生成一个查询：

In [3]: qry = \
   ...: (df.dtypes
   ...:    .reset_index(name='type')
   ...:    .groupby('type')['index']
   ...:    .apply(lambda x:
   ...:             '\n'.join(['{0[0]}_{0[1]} = ({0[0]} == {0[1]})'.format(tup, tup)
   ...:                          for tup in combinations(x, 2)]))
   ...:    .str.cat(sep='\n')
   ...: )

In [5]: print(qry)
col2_col4 = (col2 == col4)
col1_col3 = (col1 == col3)
col1_col5 = (col1 == col5)
col3_col5 = (col3 == col5)

现在我们可以这样做了：

In [6]: cols = df.columns.tolist()

In [7]: (df.eval(qry, inplace=False)
   ...:    .drop(cols, 1)
   ...:    .apply(lambda r: ';'.join(r.index[r].tolist()).replace('_',' == '), axis=1)
   ...: )
Out[7]:
0
1    col1 == col3;col1 == col5;col3 == col5
2                 col2 == col4;col1 == col5
3                 col2 == col4;col1 == col3
4                              col1 == col5
dtype: object

解释：

In [9]: df.eval(qry, inplace=False).drop(cols, 1)
Out[9]:
  col2_col4 col1_col3 col1_col5 col3_col5
0     False     False     False     False
1     False      True      True      True
2      True     False      True     False
3      True      True     False     False
4     False     False      True     False

【讨论】：

【解决方案4】：

另一种有效的方法：

a=df.values
equality=(a[:,newaxis,:]==a[:,:,newaxis])
match = row,col1,col2 = np.triu(equality,1).nonzero()

match 现在是：

(array([1, 2, 3, 3], dtype=int64),
 array([0, 1, 0, 1], dtype=int64),
 array([2, 3, 2, 3], dtype=int64))

然后漂亮的打印：

dfc=df.columns    
for i,r in enumerate(row):
    print( str(r),' : ',str(dfc[col1[i]]),'=',str(dfc[col2[i]]))

对于：

1  :  col1 = col3
2  :  col2 = col4
3  :  col1 = col3
3  :  col2 = col4

【讨论】：