无法在 Matplotlib 直方图上获取 y 轴以显示概率答案

【问题标题】：Can't get y-axis on Matplotlib histogram to display probabilities无法在 Matplotlib 直方图上获取 y 轴以显示概率
【发布时间】：2016-07-29 04:31:41
【问题描述】：

我的数据（pd 系列）看起来像（每日股票收益，n = 555）：

S = perf_manual.returns
S = S[~((S-S.mean()).abs()>3*S.std())]

2014-03-31 20:00:00    0.000000
2014-04-01 20:00:00    0.000000
2014-04-03 20:00:00   -0.001950
2014-04-04 20:00:00   -0.000538
2014-04-07 20:00:00    0.000764
2014-04-08 20:00:00    0.000803
2014-04-09 20:00:00    0.001961
2014-04-10 20:00:00    0.040530
2014-04-11 20:00:00   -0.032319
2014-04-14 20:00:00   -0.008512
2014-04-15 20:00:00   -0.034109
...

我想从中生成一个概率分布图。使用：

print stats.normaltest(S)

n, bins, patches = plt.hist(S, 100, normed=1, facecolor='blue', alpha=0.75)
print np.sum(n * np.diff(bins))

(mu, sigma) = stats.norm.fit(S)
print mu, sigma
y = mlab.normpdf(bins, mu, sigma)
plt.grid(True)
l = plt.plot(bins, y, 'r', linewidth=2)

plt.xlim(-0.05,0.05)
plt.show()

我得到以下信息：

NormaltestResult(statistic=66.587382579416982, pvalue=3.473230376732532e-15)
1.0
0.000495624926242 0.0118790391467

我的印象是 y 轴是一个计数，但我想用概率代替。我该怎么做？我已经尝试了很多 StackOverflow 答案，但无法弄清楚。

【问题讨论】：

你确定这些是计数吗？我猜它们是概率密度值，因为当您对其进行积分时，您的图表被归一化为 1。您的 x 值范围非常小。
可能，概率密度不是我的强项。我怎样才能至少把这些变成百分比？
您希望获得多少百分比？对于每个 bin，数据在这个 bin 中的概率是多少？概率密度基本上意味着某些 x 范围的密度积分可以为您提供该范围的概率。
是的，数据在 bin 中的概率。
你看过seaborn吗？几个内置的复合图，可能包含您正在寻找的内容（一旦您弄清楚数据的含义）。

标签： python matplotlib histogram probability-density

【解决方案1】：

没有简单的方法（据我所知）使用plt.hist 来做到这一点。但是您可以简单地使用np.histogram 对数据进行分箱，然后以任何您想要的方式规范化数据。如果我理解正确，您希望数据显示在给定 bin 中找到点的概率，而不是概率分布。这意味着您必须对数据进行缩放，以使所有 bin 的总和为 1。这可以通过 bin_probability = n/float(n.sum()) 简单地完成。

然后，您将不再有正确归一化的概率分布函数 (pdf)，这意味着区间上的积分将不是概率！这就是为什么您必须重新调整 mlab.normpdf 以具有与直方图相同的标准的原因。所需的因子只是 bin 宽度，因为当您从正确归一化的 binned pdf 开始时，所有 bin 的总和乘以它们各自的宽度为 1。现在您希望 bin 的总和等于 1。所以比例因子是bin 宽度。

因此，您最终得到的代码类似于：

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

# Produce test data
S = np.random.normal(0, 0.01, size=1000)

# Histogram:
# Bin it
n, bin_edges = np.histogram(S, 100)
# Normalize it, so that every bins value gives the probability of that bin
bin_probability = n/float(n.sum())
# Get the mid points of every bin
bin_middles = (bin_edges[1:]+bin_edges[:-1])/2.
# Compute the bin-width
bin_width = bin_edges[1]-bin_edges[0]
# Plot the histogram as a bar plot
plt.bar(bin_middles, bin_probability, width=bin_width)

# Fit to normal distribution
(mu, sigma) = stats.norm.fit(S)
# The pdf should not normed anymore but scaled the same way as the data
y = mlab.normpdf(bin_middles, mu, sigma)*bin_width
l = plt.plot(bin_middles, y, 'r', linewidth=2)

plt.grid(True)
plt.xlim(-0.05,0.05)
plt.show()

结果图片将是：

【讨论】：

感谢您的帮助并消除了我的困惑 :)

【解决方案2】：

jotasi 的回答当然有效，但我想添加一个非常简单的技巧来通过直接调用 hist 来实现这一点。

诀窍是使用weights 参数。默认情况下，您传递的每个数据点的权重为 1。每个 bin 的高度就是落入该 bin 的数据点的权重之和。相反，如果我们有n 点，我们可以简单地将每个点的权重设为1 / n。那么，落入某个桶的点的权重之和，也就是给定点在该桶中的概率。

在您的情况下，只需将情节线更改为：

n, bins, patches = plt.hist(S, weights=np.ones_like(S) / len(S),
                            facecolor='blue', alpha=0.75)

【讨论】：