如何突出熊猫中两个数据框之间的差异答案

【问题标题】：How to highlight differences between the two data frames in pandas如何突出熊猫中两个数据框之间的差异
【发布时间】：2022-01-03 22:29:55
【问题描述】：

我在 pandas 中有两个具有相同列的数据框，我们在 A 列有索引

实际：

  A      B       C 
  1     apple   red
  2     berry   blue 
  3    grapes   green

第二个数据帧

预期：

  A    B     C
  1   apple  green
  2   guava  blue
  3   grapes  green

现在我需要比较两个数据框并突出显示数据框中不匹配的单元格，然后将输出导出到 excel。

我的代码：

import pandas as pd

pd.concat([pd.concat([actual,expected,expected]).drop_duplicates(keep=False)]).to_excel(.......)

输出：

  A   B   C
  1  apple red
  2  berry blue

我需要突出红色和浆果色

【问题讨论】：

“突出显示”是什么意思？

标签： pandas

【解决方案1】：

有一个函数 - compare，它可以帮助您比较两个数据集： https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html

df = pd.DataFrame({
        "A": [1, 2, 3],
        "B": ["apple", "berry", "grapes"],
        "C": ["red", "blue", "green"]
    },
    columns=["A", "B", "C"])

df2 = df.copy()
df2.loc[0, 'C'] = 'green'
df2.loc[2, 'B'] = 'guava'

使用它，您可以比较两个数据集：

df.compare(df2)

给你：

    B               C
    self    other   self    other
0   NaN     NaN     red     green
2   grapes  guava   NaN     NaN

通过过滤掉不需要的行（相同的）和列，您可以获得仅包含与原始数据框不同的数据的数据框：

compare = df.compare(df2, keep_shape=True).drop('other', level=1, axis=1)
compare = compare.droplevel(1, axis=1).dropna(how='all')

    A   B       C
0   NaN NaN     red
2   NaN grapes  NaN

因为我们需要从原始数据集中过滤出相同的行：

filtered = df.loc[compare.index]

现在，我们可以以某种方式“突出显示”不同的数据：

def color_cells(s):
    if pd.notna(s):
        return 'color:{0}; font-weight:bold'.format('red')
    else:
        return ''

filtered.style.apply(lambda x: compare.applymap(color_cells), axis=None)

应该产生类似的东西：

【讨论】：

这也是我会做的。

【解决方案2】：

这可以通过 StyleFrame 轻松完成，如本文所示：color specific cells in excel python。

首先，运行 pip install styleframe。然后按照以下步骤操作：

from styleframe import StyleFrame, Styler
import pandas as pd

# true and expected data
d_true = {'A':[1,2,3], 'B':['apple', 'berry', 'grapes'], 'C':['red', 'blue', 'green']}
df_true = pd.DataFrame(d_true)

d_exp = {'A':[1,2,3], 'B':['apple', 'guava', 'grapes'], 'C':['green', 'blue', 'green']}
df_exp = pd.DataFrame(d_exp)

# pass df to styleFrame 
sf1 = StyleFrame(df_true)
sf2 = StyleFrame(df_exp)

# Set color for differences 
sf1_diff = Styler(bg_color='#FFCCCC') # red
sf2_diff = Styler(bg_color='#DAF6FF') # blue

# Difference matrix 
ne = sf1.data_df != sf2.data_df

# apply above color style where difereneces are true
for col in ne.columns:
    sf1.apply_style_by_indexes(indexes_to_style=ne[ne[col]].index,
                               styler_obj=sf1_diff,
                               cols_to_style=col)
    sf2.apply_style_by_indexes(indexes_to_style=ne[ne[col]].index,
                               styler_obj=sf2_diff,
                               cols_to_style=col)

# save your excel 
sf1.to_excel('df1_diff_in_color.xlsx').save()
sf2.to_excel('df2_diff_in_color.xlsx').save()

输出excel文件：

【讨论】：