使用 pandas 覆盖多个直方图答案

【问题标题】：Overlaying multiple histograms using pandas使用 pandas 覆盖多个直方图
【发布时间】：2013-10-17 07:15:21
【问题描述】：

我有两个或三个具有相同标题的 csv 文件，我想在同一个图上绘制相互重叠的每一列的直方图。

以下代码为我提供了两个单独的数字，每个数字都包含每个文件的所有直方图。有没有一种紧凑的方法可以使用 pandas/matplot lib 将它们一起绘制在同一个图形上？我想象一些接近 this 但使用数据框的东西。

代码：

import pandas as pd
import matplotlib.pyplot as plt

df =  pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)

plt.show()

【问题讨论】：

标签： python matplotlib statistics pandas

【解决方案1】：

Phillip Cloud 的回答已经解决了在单个图中覆盖两个（或多个）包含相同变量的数据帧的直方图的主要问题。

该答案为问题作者提出的问题（在已接受答案的 cmets 中）提供了解决方案，该问题涉及如何对两个数据帧共有的变量强制执行相同数量的 bin 和范围。这可以通过创建两个数据帧的所有变量共有的 bin 列表来完成。事实上，这个答案更进一步，通过调整每个数据帧中包含的不同变量覆盖略有不同范围（但仍在同一数量级内）的情况的图，如下例所示：

import numpy as np                   # v 1.19.2
import pandas as pd                  # v 1.1.3
import matplotlib.pyplot as plt      # v 3.3.2
from matplotlib.lines import Line2D

# Set seed for random data
rng = np.random.default_rng(seed=1)

# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
                        var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
                        var2 = rng.normal(loc=50, scale=10, size=df2_size)))

# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]

# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)

# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
               linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
         linewidth=lw, alpha=alpha, label='df2')

# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
         for h in handles]
plt.legend(lines, labels, frameon=False)

plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()

值得注意的是，seaborn 包提供了一种更方便的方式来创建这种绘图，与 pandas 不同，bin 是自动对齐的。唯一的缺点是数据帧必须首先组合并重新整形为长格式，如本例所示，使用与之前相同的数据帧和 bin：

import seaborn as sns    # v 0.11.0

# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')

# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                element='step', bins=bin_edges, fill=False, height=4,
                facet_kws=dict(sharex=False, sharey=False))

# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)

# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)

plt.show()

您可能会注意到，直方图线在 bin 边缘列表的限制处被截断（由于比例，在最大一侧不可见）。为了获得更类似于 pandas 示例的行，可以在 bin 列表的每个末端添加一个空 bin，如下所示：

bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)

此示例还说明了这种为两个方面设置公共 bin 的方法的局限性。由于 var1 和 var2 的范围有些不同，并且使用 30 个 bin 来覆盖组合范围，因此 var1 的直方图包含相当少的 bin，而 var2 的直方图的 bin 比需要的多一些。据我所知，在调用绘图函数 df.hist() 和 displot(df) 时，没有直接的方法可以为每个方面分配不同的 bin 列表。因此，对于变量覆盖显着不同范围的情况，必须使用 matplotlib 或其他一些绘图库从头开始创建这些数字。

【讨论】：

【解决方案2】：

In [18]: from pandas import DataFrame

In [19]: from numpy.random import randn

In [20]: df = DataFrame(randn(10, 2))

In [21]: df2 = DataFrame(randn(10, 2))

In [22]: axs = df.hist()

In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
   ....:     values.hist(ax=ax, bins=10)
   ....:

In [24]: draw()

给

【讨论】：

酷。这看起来像我想看到的！有什么方法可以为两个数据帧强制使用相同数量的箱（和范围）？我为我的 df 手动设置了 20 个 bin，但 df2 可能有不同的范围，所以图像看起来很奇怪。
Series 的hist 方法（values 的类型）可以用bins 关键字参数调用。我会将其添加到答案中。
不错。我没想到。然而，由于范围不同，我仍然没有得到我想要的（即link）。我认为这样做可能不是那么简单。
哦，我明白了。我认为您想共享 x 轴和可能的 y 轴。查看this example