如何用 NumPy 获得累积分布函数？答案

【问题标题】：How to get the cumulative distribution function with NumPy?如何用 NumPy 获得累积分布函数？
【发布时间】：2012-05-25 07:51:19
【问题描述】：

我想用 NumPy 创建一个 CDF，我的代码是下一个：

histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
   for y in range(0, height):
      histo[data[x][y]] += 1
      q = 0 
   cdf = list()
   for i in histo:
      q = q + i
      cdf.append(q)

我在阵列旁行走，但程序执行需要很长时间。有这个功能的内置函数，不是吗？

【问题讨论】：

标签： python numpy histogram

【解决方案1】：

补充 Dan 的解决方案。如果您的样本中有多个相同的值，您可以使用 numpy.unique ：

Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size

plt.plot(X, F)

【讨论】：

这为您提供了大于 1 的 F 值。也许您打算使用 F = F / float(F.max())（还请记住，整数除法会给使用 Python 2x 的人带来问题）。跨度>
这个答案很旧，感谢您的cmets和答案。我在每个答案中都看到了我三年前的基本方法。
@Alex 这不太正确，因为对于不止一次的条目，它应该上升超过 1/N。你是对的，我的解决方案只对最后一次这样的情况是正确的，但它会正确绘制。
原则上你使用的是计数，但 python 在 F 中使用基于零的索引，所以也许你的意思是 (F + 1) / (F[-1] + 1)

【解决方案2】：

我不确定是否有现成的答案，确切的做法是定义一个函数，如：

def _cdf(x,data):
    return(sum(x>data))

这会很快。

【讨论】：

【解决方案3】：

使用直方图是一种解决方案，但它涉及对数据进行分箱。这对于绘制经验数据的 CDF 不是必需的。让F(x) 成为小于x 的条目数，然后它会增加一，这正是我们看到的测量值。因此，如果我们对样本进行排序，那么在每一点我们将计数增加 1（或分数增加 1/N）并将一个与另一个进行对比，我们将看到“精确的”（即未分箱的）经验 CDF。

以下代码示例演示了该方法

import numpy as np
import matplotlib.pyplot as plt

N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)

plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()

输出如下

【讨论】：

根据 numpy.histogram 文档：normed 等效于 density 参数，但会为不相等的 bin 宽度产生不正确的结果。在 1.15.0 版更改：实际上发出了 DeprecationWarnings。

【解决方案4】：

numpy 版本 1.9.0 的更新。 user545424 的答案在 1.9.0 中不起作用。这有效：

>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,
    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])
>>> hist
array([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,
    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> np.diff(bin_edges)
array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])
>>>

【讨论】：

user12287，我觉得编辑别人的答案很奇怪。此外，不同版本的答案也不同。

【解决方案5】：

我不太确定您的代码在做什么，但如果您有 hist 和 bin_edges 由 numpy.histogram 返回的数组，您可以使用 numpy.cumsum 生成直方图内容的累积总和。

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])

【讨论】：

但是，这引入了一个分箱步骤，这对于累积分布来说是不必要的。
"这个关键字 normed 在 Numpy 1.6 中由于令人困惑/错误的行为而被弃用。它将在 Numpy 2.0 中删除。"如果 bin 不在 @987654327 中，则代码中有一个错误@。添加 x=np.cumsum(hist); x=(x - x.min()) / x.ptp()