逐行、逐个单元地比较 2 个 Pandas 数据帧答案

【问题标题】：Compare 2 Pandas dataframes, row by row, cell by cell逐行、逐个单元地比较 2 个 Pandas 数据帧
【发布时间】：2017-01-20 08:50:27
【问题描述】：

我有 2 个数据帧，df1 和 df2，并且想要执行以下操作，将结果存储在 df3：

for each row in df1:

    for each row in df2:

        create a new row in df3 (called "df1-1, df2-1" or whatever) to store results 

        for each cell(column) in df1: 

            for the cell in df2 whose column name is the same as for the cell in df1:

                compare the cells (using some comparing function func(a,b) ) and, 
                depending on the result of the comparison, write result into the 
                appropriate column of the "df1-1, df2-1" row of df3)

例如：

df1
A   B    C      D
foo bar  foobar 7
gee whiz herp   10

df2
A   B   C      D
zoo car foobar 8

df3
df1-df2 A             B              C                   D
foo-zoo func(foo,zoo) func(bar,car)  func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar)   func(10,8)

我从这个开始：

for r1 in df1.iterrows():
    for r2 in df2.iterrows():
        for c1 in r1:
            for c2 in r2:

但我不确定如何处理它，希望能得到一些帮助。

【问题讨论】：

因为您正在将 func 应用于同名列，所以您可以仅遍历列并使用矢量化，例如df3['A'] = func(df1['A'], df2['A']) 等等？
@StarFox 很有趣，所以我可以这样做：对于 df3 中的列：df3[column] = func(df1[column], df2[column])？
当然！这就是 pandas/numpy 的力量（通常是矢量化）。我将在下面提供一些示例，我们将从那里开始
我认为你可以在你的 2 个数据帧之间的笛卡尔积上建立一个解决方案，看看这里作为一个起点：stackoverflow.com/questions/13269890/…
@Svend 这似乎是一个很有前途的想法，谢谢！不过，我会先尝试 StarFox 的解决方案。

标签： python pandas dataframe iterator iteration

【解决方案1】：

所以要在 cmets 中继续讨论，可以使用矢量化，这是 pandas 或 numpy 等库的卖点之一。理想情况下，您不应该打电话给iterrows()。我的建议更明确一点：

# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']

# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0    foofoofoozoo
1             NaN
Name: A, dtype: object

# more generally

# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns) 
for colName in df1:
    df3[colName] = func(df1[colName], df2[colName])

现在，您甚至可以将不同的函数应用于不同的列，例如，创建 lambda 函数，然后使用列名压缩它们：

# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]

# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
    df3[colName] = func(df1[colName], df2[colName])

唯一想到的“问题”是您需要确保您的函数适用于列中的数据。例如，如果您要执行 df1['A'] - df2['A'] 之类的操作（使用您提供的 df1、df2），则会引发 ValueError，因为两个字符串的减法未定义。只是需要注意的事情。

编辑，回复：您的评论：这也是可行的。遍历更大的 dfX.columns，这样您就不会遇到 KeyError，并在其中抛出 if 语句：

# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
    if colName not in df1:
        df3[colName] = np.nan # be sure to import numpy as np
    else:
        df3[colName] = func(df1[colName], df2[colName])

【讨论】：

是的，这很有帮助，我已经接受它作为答案，非常感谢您抽出宝贵时间！如果列数不相等，可以修改它以使用吗？即，df1 中可能存在 df2 中不存在的列；比较函数应该只输出 N/A 之类的内容。