规模大于和小于1的scipy正态分布[重复]答案

【问题标题】：scipy normal distribution with scale greater and less than 1 [duplicate]规模大于和小于1的scipy正态分布[重复]
【发布时间】：2020-04-08 01:11:21
【问题描述】：

我正在使用 numpy 的正态分布，并且很难理解它的文档。假设我有一个平均值为 5 且标准差为 0.5 的正态分布：

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm

mean = 5
std = 0.25

x = np.linspace(mean - 3*std, mean + 3*std, 1000)
y = norm(loc=mean, scale=std).pdf(x)
plt.plot(x,y)

生成的图表是熟悉的钟形曲线，但其峰值在 1.6 左右。任何值的概率如何超过 1？如果我将它乘以scale，那么概率是正确的。

但是当std（和scale）大于1时没有这样的问题：

mean = 5
std = 10

x = np.linspace(mean - 3*std, mean + 3*std, 1000)
y = norm(loc=mean, scale=std).pdf(x)
plt.plot(x,y)

norm 上的 documentation 表示 loc 是平均值，scale 是标准差。为什么scale 大于和小于 1 时表现如此奇怪？

Python 3.8.2。 Scipy 1.4.1

【问题讨论】：

标签： python scipy normal-distribution

【解决方案1】：

您正在绘制的“钟形曲线”是一个概率密度函数 (PDF)。这意味着具有该分布的随机变量落在任何区间 [a, b] 的概率是 a 之间的曲线下面积和b。因此曲线下的整个面积（从-infinity到+infinity）一定是1。所以当标准差较小时，PDF的最大值很可能大于1，这并不奇怪。

追问：第一张图的曲线下面积真的是1吗？

是的，是的。确认这一点的一种方法是通过计算一系列高度由曲线定义的矩形的总面积来近似曲线下的面积：

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
import matplotlib.patches as patches

mean = 5
std = 0.25

x = np.linspace(4, 6, 1000)
y = norm(loc=mean, scale=std).pdf(x)

fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_aspect('equal')
ax.set_xlim([4, 6])
ax.set_ylim([0, 1.7])

# Approximate area under the curve by summing over rectangles:

xlim_approx = [4, 6]  # locations of left- and rightmost rectangle
n_approx = 17  # number of rectangles

# width of one rectangle:
width_approx = (xlim_approx[1] - xlim_approx[0]) / n_approx  
# x-locations of rectangles:
x_approx = np.linspace(xlim_approx[0], xlim_approx[1], n_approx)
# heights of rectangles:
y_approx = norm(loc=mean, scale=std).pdf(x_approx)

# plot approximation rectangles:
for i, xi in enumerate(x_approx):
    ax.add_patch(patches.Rectangle((xi - width_approx/2, 0), width_approx, 
                                   y_approx[i], facecolor='gray', alpha=.3))

# areas of the rectangles:
areas = y_approx * width_approx

# total area of the rectangles:
print(sum(areas))

0.9411599204607589

好的，这不是 1，但让我们通过扩展 x 限制和增加矩形的数量来获得更好的近似值：

xlim_approx = [0, 10]
n_approx = 100_000

width_approx = (xlim_approx[1] - xlim_approx[0]) / n_approx
x_approx = np.linspace(xlim_approx[0], xlim_approx[1], n_approx)
y_approx = norm(loc=mean, scale=std).pdf(x_approx)

areas = y_approx * width_approx
print(sum(areas))

0.9999899999999875

【讨论】：

在第一个图中，曲线下的整个面积不可能是 1。PDF 表示具有精确统计数据的事件发生的概率，所有概率必须在 [0, 1 ]。它在 1.6 达到峰值，因为 scipy 将其除以 std (0.25)。问题是为什么 scipy 在 std = 0.25 时会这样做，但在 std = 10 时不会这样做。
“PDF 表示具有精确统计数据的事件发生的概率......” 不，这不是 PDF 所表示的。它是概率密度，而不是概率。来自正态分布的样本恰好为 0.3 的可能性为 0。这对于所有连续分布都是正确的：任何给定值的概率为 0。为了说服自己，请参阅实际公式，给出为 f(x ) 在wikipedia page。请注意，f(μ) = 1/(σ*sqrt(2π))。每当 σ 1。但 x=μ 的概率为 0。
我坚持我的回答。 @Warren Weckesser 给出了很好的解释。或者，您可以通过近似看到曲线下的面积为 1。我扩展了我的答案以显示这一点。