【问题标题】:Overlaying multiple histograms using pandas使用 pandas 覆盖多个直方图
【发布时间】:2013-10-17 07:15:21
【问题描述】:

我有两个或三个具有相同标题的 csv 文件,我想在同一个图上绘制相互重叠的每一列的直方图。

以下代码为我提供了两个单独的数字,每个数字都包含每个文件的所有直方图。有没有一种紧凑的方法可以使用 pandas/matplot lib 将它们一起绘制在同一个图形上?我想象一些接近 this 但使用数据框的东西。

代码:

import pandas as pd
import matplotlib.pyplot as plt

df =  pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)

plt.show()

【问题讨论】:

    标签: python matplotlib statistics pandas


    【解决方案1】:

    Phillip Cloud 的回答已经解决了在单个图中覆盖两个(或多个)包含相同变量的数据帧的直方图的主要问题。

    该答案为问题作者提出的问题(在已接受答案的 cmets 中)提供了解决方案,该问题涉及如何对两个数据帧共有的变量强制执行相同数量的 bin 和范围。这可以通过创建两个数据帧的所有变量共有的 bin 列表来完成。事实上,这个答案更进一步,通过调整每个数据帧中包含的不同变量覆盖略有不同范围(但仍在同一数量级内)的情况的图,如下例所示:

    import numpy as np                   # v 1.19.2
    import pandas as pd                  # v 1.1.3
    import matplotlib.pyplot as plt      # v 3.3.2
    from matplotlib.lines import Line2D
    
    # Set seed for random data
    rng = np.random.default_rng(seed=1)
    
    # Create two similar dataframes each containing two random variables,
    # with df2 twice the size of df1
    df1_size = 1000
    df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
                            var2 = rng.normal(loc=40, scale=5, size=df1_size)))
    df2_size = 2*df1_size
    df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
                            var2 = rng.normal(loc=50, scale=10, size=df2_size)))
    
    # Combine the dataframes to extract the min/max values of each variable
    df_combined = pd.concat([df1, df2])
    vars_min = [df_combined[var].min() for var in df_combined]
    vars_max = [df_combined[var].max() for var in df_combined]
    
    # Create custom bins based on the min/max of all values from both
    # dataframes to ensure that in each histogram the bins are aligned
    # making them easily comparable
    nbins = 30
    bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)
    
    # Create figure by combining the outputs of two pandas df.hist() function
    # calls using the 'step' type of histogram to improve plot readability
    htype = 'step'
    alpha = 0.7
    lw = 2
    axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
                   linewidth=lw, alpha=alpha, label='df1')
    df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
             linewidth=lw, alpha=alpha, label='df2')
    
    # Adjust x-axes limits based on min/max values and step between bins, and
    # remove top/right spines: if, contrary to this example dataset, var1 and
    # var2 cover the same range, setting the x-axes limits with this loop is
    # not necessary
    for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
        ax.set_xlim(v_min-2*step, v_max+2*step)
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
    
    # Edit legend to get lines as legend keys instead of the default polygons:
    # use legend handles and labels from any of the axes in the axs object
    # (here taken from first one) seeing as the legend box is by default only
    # shown in the last subplot when using the plt.legend() function.
    handles, labels = axs.flatten()[0].get_legend_handles_labels()
    lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
             for h in handles]
    plt.legend(lines, labels, frameon=False)
    
    plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
    plt.show()
    

    值得注意的是,seaborn 包提供了一种更方便的方式来创建这种绘图,与 pandas 不同,bin 是自动对齐的。唯一的缺点是数据帧必须首先组合并重新整形为长格式,如本例所示,使用与之前相同的数据帧和 bin:

    import seaborn as sns    # v 0.11.0
    
    # Combine dataframes and convert the combined dataframe to long format
    df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
    df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')
    
    # Create figure using seaborn displot: note that the bins are automatically
    # aligned thanks the 'common_bins' parameter of the seaborn histplot function
    # (called here with 'kind='hist'') that is set to True by default. Here, the
    # bins from the previous example are used to make the figures more comparable.
    # Also note that the facets share the same x and y axes by default, this can
    # be changed when var1 and var2 have different ranges and different
    # distribution shapes, as it is the case in this example.
    g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                    element='step', bins=bin_edges, fill=False, height=4,
                    facet_kws=dict(sharex=False, sharey=False))
    
    # For some reason setting sharex as above does not automatically adjust the
    # x-axes limits (even when not setting a bins argument, maybe due to a bug
    # with this package version) which is why this is done in the following loop,
    # but note that you still need to set 'sharex=False' in displot, or else
    # 'ax.set.xlim' will have no effect.
    for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
        ax.set_xlim(v_min-2*step, v_max+2*step)
    
    # Additional formatting
    g.legend.set_bbox_to_anchor((.9, 0.75))
    g.legend.set_title('')
    plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)
    
    plt.show()
    

    您可能会注意到,直方图线在 bin 边缘列表的限制处被截断(由于比例,在最大一侧不可见)。为了获得更类似于 pandas 示例的行,可以在 bin 列表的每个末端添加一个空 bin,如下所示:

    bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
    bin_edges = np.append(bin_edges, bin_edges.max()+step)
    

    此示例还说明了这种为两个方面设置公共 bin 的方法的局限性。由于 var1 和 var2 的范围有些不同,并且使用 30 个 bin 来覆盖组合范围,因此 var1 的直方图包含相当少的 bin,而 var2 的直方图的 bin 比需要的多一些。据我所知,在调用绘图函数 df.hist()displot(df) 时,没有直接的方法可以为每个方面分配不同的 bin 列表。因此,对于变量覆盖显着不同范围的情况,必须使用 matplotlib 或其他一些绘图库从头开始创建这些数字。

    【讨论】:

      【解决方案2】:
      In [18]: from pandas import DataFrame
      
      In [19]: from numpy.random import randn
      
      In [20]: df = DataFrame(randn(10, 2))
      
      In [21]: df2 = DataFrame(randn(10, 2))
      
      In [22]: axs = df.hist()
      
      In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
         ....:     values.hist(ax=ax, bins=10)
         ....:
      
      In [24]: draw()
      

      【讨论】:

      • 酷。这看起来像我想看到的!有什么方法可以为两个数据帧强制使用相同数量的箱(和范围)?我为我的 df 手动设置了 20 个 bin,但 df2 可能有不同的范围,所以图像看起来很奇怪。
      • Serieshist 方法(values 的类型)可以用bins 关键字参数调用。我会将其添加到答案中。
      • 不错。我没想到。然而,由于范围不同,我仍然没有得到我想要的(即link)。我认为这样做可能不是那么简单。
      • 哦,我明白了。我认为您想共享 x 轴和可能的 y 轴。查看this example
      猜你喜欢
      • 2018-03-14
      • 2020-01-19
      • 1970-01-01
      • 2015-11-30
      • 2014-10-21
      • 1970-01-01
      • 2018-08-02
      • 2019-10-06
      • 2011-10-20
      相关资源
      最近更新 更多