使用 Monte Carlo 与 scipy.integrate.nquad 的不同积分结果答案

【问题标题】：Different integration results using Monte Carlo vs scipy.integrate.nquad使用 Monte Carlo 与 scipy.integrate.nquad 的不同积分结果
【发布时间】：2016-06-24 11:21:56
【问题描述】：

下面的 MWE 显示了集成相同 2D 内核密度估计的两种方法，使用 stats.gaussian_kde() 函数为 this data 获得。

对低于阈值点(x1, y1) 的所有(x, y) 执行积分，该阈值定义了积分上限（积分下限为-infinity；请参阅MWE）。

int1 函数使用简单的蒙特卡罗方法。
int2 函数使用scipy.integrate.nquad 函数。

问题在于int1（即：蒙特卡洛方法）系统地给出了比int2 更大的积分值。我不知道为什么会这样。

下面是int1（蓝色直方图）200 次运行后获得的积分值与int2（红色垂直线）给出的积分结果的示例：

产生的积分值差异的根源是什么？

MWE

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate


def int1(kernel, x1, y1):
    # Compute the point below which to integrate
    iso = kernel((x1, y1))

    # Sample KDE distribution
    sample = kernel.resample(size=50000)

    # Filter the sample
    insample = kernel(sample) < iso

    # The integral is equivalent to the probability of drawing a
    # point that gets through the filter
    integral = insample.sum() / float(insample.shape[0])

    return integral


def int2(kernel, x1, y1):

    def f_kde(x, y):
        return kernel((x, y))

    # 2D integration in: (-inf, x1), (-inf, y1).
    integral = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])

    return integral


# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)

# Define the threshold point that determines the integration limits.
x1, y1 = 2.5, 1.5

i2 = int2(kernel, x1, y1)
print i2

int1_vals = []
for _ in range(200):
    i = int1(kernel, x1, y1)
    int1_vals.append(i)
    print i

添加

请注意，此问题源自this answer。起初我并没有注意到答案在使用的积分限制上有误，这就解释了为什么int1 和int2 之间的结果不同。

int1 正在整合到域 f(x,y)<f(x1,y1)（其中 f 是内核密度估计），而 int2 整合到域 (x,y)<(x1,y1)。

【问题讨论】：

标签： python scipy montecarlo integral

【解决方案1】：

你重新采样分布

sample = kernel.resample(size=50000)

然后计算每个采样点的概率小于边界处的概率

insample = kernel(sample) < iso

这是不正确的。考虑边界 (0,100) 并假设您的数据具有 u=(0,0) 和 cov=[[100,0],[0,100]]。点 (0,50) 和 (50,0) 在该内核中具有相同的概率，但只有其中一个在边界内。由于两者都通过了测试，因此您过度采样了。

您应该测试sample 中的每个点是否在边界内，然后计算概率。类似的东西

def int1(kernel, x1, y1):
    # Sample KDE distribution                                                                                                              
    sample = kernel.resample(size=100)

    include = (sample < np.repeat([[x1],[y1]],sample.shape[1],axis=1)).all(axis=0)
    integral = include.sum() / float(sample.shape[1])
    return integral

我使用以下代码对此进行了测试

def measure(n):

    m1 = np.random.normal(size=n)
    m2 = np.random.normal(size=n)
    return m1,m2

a = scipy.stats.gaussian_kde( np.vstack(measure(1000)) )
print(int1(a,-10,-10))
print(int2(a,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))

产量

0.0
(4.304674927251112e-232, 4.6980863813551415e-230)
0.26
(0.25897626178338407, 1.4536217446381293e-08)

蒙特卡洛积分应该像这样工作

在 x/y 的可能值的某个子集上采样 N 个随机值（均匀地，不是来自您的分布）（在下面，我将其限制在距离平均值 10 个 SD 的范围内）。
对于每个随机值计算内核(rand_x,rand_y)
计算总和并乘以（体积）/N_samples

在代码中：

def mc_wo_sample(kernel,x1,y1,lboundx,lboundy):
    nsamples = 50000
    volume = (x1-lboundx)*(y1-lboundy)
    # generate uniform points in range                                                                                                     
    xrand = np.random.rand(nsamples,1)*(x1-lboundx) + lboundx
    yrand = np.random.rand(nsamples,1)*(y1-lboundy) + lboundy
    randvals = np.hstack((xrand,yrand)).transpose()
    print randvals.shape
    return (volume*kernel(randvals).sum())/nsamples

运行以下

   print(int1(a,-9,-9))
   print(int2(a,-9,-9))
   print(mc_wo_sample(a,-9,-9,-10,-10))
   print(int1(a,0,0))
   print(int2(a,-0,-0))
   print(mc_wo_sample(a,0,0,-10,-10))

产量

0.0
(4.012958496109042e-70, 6.7211236076277e-71)
4.08538890986e-70
0.36
(0.37101621760650216, 1.4670898180664756e-08)
0.361614657674

【讨论】：

我是这么认为的。上面的代码失败了：ValueError: operands could not be broadcast together with shapes (2,50000) (2,2) 。你测试过吗？你能让它运行吗？
将sample.shape[0] 更改为sample.shape[1]。该值应该是样本数。我正在使用我自己的测试代码翻译您的示例。
感谢 dfb。它现在可以编译，但是使用您的代码的 MC 方法的结果是 ~0.06。这与 ~0.194 的 nquad 结果非常不同。我的问题中的 MC 方法为积分提供了更接近的值。
查看我的编辑 - 我们不想对概率求和，我们只是测试我们是否在样本上击中了框内。
这个答案不仅解决了我的问题，而且速度也快了很多倍。非常感谢！您应该删除 int1 函数中的 iso = kernel((x1, y1)) 行，它没有被使用。再次感谢！