如何在 python 中使用 MLE 拟合双指数分布？答案

【问题标题】：How to fit double exponential distribution using MLE in python?如何在 python 中使用 MLE 拟合双指数分布？
【发布时间】：2020-01-03 11:04:39
【问题描述】：

我正在尝试使用 MLE 拟合双指数（即两个指数或双指数的混合）数据。虽然没有此类问题的直接示例，但我发现了一些使用 MLE 进行线性 (Maximum Likelihood Estimate pseudocode)、sigmoidal (https://stats.stackexchange.com/questions/66199/maximum-likelihood-curve-model-fitting-in-python) 和正态 (Scipy MLE fit of a normal distribution) 分布拟合的提示。使用这些示例，我测试了以下代码：

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
import scipy.stats as stats

size = 300

def simu_dt():
    ## simulate Exp2 data
    np.random.seed(0)
    ## generate random values between 0 to 1
    x = np.random.rand(size)
    data = []
    for n in x:
        if n < 0.6:
            # generating 1st exp data
            data.append(np.random.exponential(scale=20)) # t1
        else:
            # generating 2nd exp data
            data.append(np.random.exponential(scale=500)) # t2
    return np.array(data)

ydata2 = simu_dt() # call to generate simulated data
## trimming the data at the beginning and the end a bit
ydata2 = ydata2[np.where(2 < ydata2)]
ydata2 = ydata2[np.where(ydata2 < 3000)]

## creating the normalized log binned histogram data
bins = 10 ** np.linspace(np.log10(np.min(ydata2)), np.log10(np.max(ydata2)), 10)
counts, bin_edges = np.histogram(ydata2, bins=bins)
bin_centres = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_width = (bin_edges[1:] - bin_edges[:-1])
counts = counts / bin_width / np.sum(counts)

## generating arbitrary x value
x1 = np.linspace(bin_centres.min(), bin_centres.max(), len(ydata2))

def MLE(params):
    """ find the max likelihood """
    a1, k1, k2, sd = params
    yPred = (1-a1)*k1*np.exp(-k1*x1) + a1*k2*np.exp(-k2*x1)
    negLL = -np.sum(stats.norm.pdf(ydata2, loc=yPred, scale=sd))
    return negLL

guess = np.array([0.4, 1/30, 1/320, 0.2])
bnds = ((0, 0.9), (1/200, 1/2), (1/1000, 1/100), (0, 1))
## best function used for global fitting

results = optimize.minimize(MLE, guess, method='SLSQP', bounds=bnds)

print(results)
A1, K1, K2, _ = results.x
y_fitted = (1-A1)*K1*np.exp(-K1*x1) + A1*K2*np.exp(-K2*x1)

## plot actual data
plt.plot(bin_centres, counts, 'ko', label=" actual data")
plt.xlabel("Dwell Times (s)")
plt.ylabel("Probability")

## plot fitted data on original data
plt.plot(x1, y_fitted, c='r', linestyle='dashed', label="fit")
plt.legend()
plt.xscale('log')
plt.yscale('log')

plt.show()

拟合总结显示：

Out:
 fun: -1.7494005752178573e-16
     jac: array([-3.24161825e-18,  0.00000000e+00,  4.07105635e-16, -6.38053319e-14])
 message: 'Optimization terminated successfully.'
    nfev: 6
     nit: 1
    njev: 1
  status: 0
 success: True
       x: array([0.4       , 0.03333333, 0.003125  , 0.2       ])

This is the plot showing the fit。虽然拟合似乎有效，但结果返回了我提供的猜测！此外，如果我改变猜测，拟合也会发生变化，这意味着它可能根本不会收敛。我不确定我做错了什么。只是说我也不是 Python 和数学方面的专家。因此，非常感谢任何帮助。提前致谢。

【问题讨论】：

有些事情我不确定你是否如愿以偿。最重要的是，当您说“双指数”分布时，您的意思是什么？这？en.wikipedia.org/wiki/Laplace_distribution 在这里，您正在生成比率为 0.6 到 0.4 的两个分布的加权和
嗨辛迪，感谢您的评论。 “双指数”是指我的实际数据混合了双指数分布。例如，在我的代码中，我尝试用 20 和 500（单位）的值模拟两个指数，它们的贡献应该等于 1（0.4+0.6）。我回答你的问题了吗？
是的，谢谢。你为什么要计算像 negLL = -np.sum(stats.norm.pdf(ydata2, loc=yPred, scale=sd)) 这样的负对数似然？
如果我听到“双指数”，我也会考虑exp(exp(x))，所以，感谢您的澄清。像这样配合的结果通常非常困难。我建议取数据的对数，然后在 x 值的相应范围内执行两次线性拟合。
@nostradamus：我已经对其进行了相应的编辑，谢谢。实际上，人们已经用 matlab 和其他编程语言（已发表论文中的 Igor pro 等）完成了它，但不知道代码或它是如何工作的。感谢您对日志数据的线性拟合提出建议。我可以尝试，但我认为我需要预先定义 x 范围，对吗？

标签： python mle exponential-distribution

【解决方案1】：

有几个地方我会说你犯了错误。例如，您直接传递 x1（等距 x 值）而不是 ydata2。然后，您使用了不合适的negativeLL，因为您应该在某些参数的假设下对自己的概率取负对数。因此，您的第四个参数是不必要的。您的功能应该是：

def MLE(params):
    """ find the max likelihood """
    a1, k1, k2 = params
    yPred = (1-a1)*k1*np.exp(-k1*ydata2) + a1*k2*np.exp(-k2*ydata2)
    negLL = -np.sum(np.log(yPred))
    return negLL

由于数值原因（严重缩放），代码仍然无法收敛，一些线性化建议可能会有所帮助。您可以轻松做的是将优化方法更改为例如L-BFGS-B 和该方法应该正确收敛。

完整代码：

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
import scipy.stats as stats

size = 300000
nbins = 30

def simu_dt():
    ## simulate Exp2 data
    np.random.seed(20)
    ## generate random values between 0 to 1
    x = np.random.rand(size)
    data = []
    for n in x:
        if n < 0.6:
            # generating 1st exp data
            data.append(np.random.exponential(scale=20)) # t1
        else:
            # generating 2nd exp data
            data.append(np.random.exponential(scale=500)) # t2
    return np.array(data)

ydata2 = simu_dt() # call to generate simulated data
## trimming the data at the beginning and the end a bit
ydata2 = ydata2[np.where(2 < ydata2)]
ydata2 = ydata2[np.where(ydata2 < 3000)]

## creating the normalized log binned histogram data
bins = 10 ** np.linspace(np.log10(np.min(ydata2)), np.log10(np.max(ydata2)), nbins)
counts, bin_edges = np.histogram(ydata2, bins=bins)
bin_centres = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_width = (bin_edges[1:] - bin_edges[:-1])
counts = counts / bin_width / np.sum(counts)

## generating arbitrary x value
x1 = np.linspace(bin_centres.min(), bin_centres.max(), len(ydata2))

def MLE(params):
    """ find the max likelihood """
    k1, k2 = params
    yPred = 0.6*k1*np.exp(-k1*ydata2) + 0.4*k2*np.exp(-k2*ydata2)
    negLL = -np.sum(np.log(yPred))
    return negLL

guess = np.array([1/30, 1/200])
bnds = ((1/100, 1/2), (1/1000, 1/100))
## best function used for global fitting

results = optimize.minimize(MLE, guess, bounds=bnds)

print(results)
K1, K2 = results.x
y_fitted = 0.6*K1*np.exp(-K1*x1) + 0.4*K2*np.exp(-K2*x1)

## plot actual data
plt.plot(bin_centres, counts, 'ko', label=" actual data")
plt.xlabel("Dwell Times (s)")
plt.ylabel("Probability")

## plot fitted data on original data
plt.plot(x1, y_fitted, c='r', linestyle='dashed', label="fit")
plt.legend()
plt.xscale('log')
plt.yscale('log')

plt.show()

【讨论】：

你能解释一下为什么你使用“np.sum(np.log(-np.log(yPred)))”而不是“-np.sum(np.log(yPred)) “？我的意思是为什么您在完整代码中使用了两次日志？否则，使用 -np.sum(np.log(yPred))，你的建议对我来说似乎很好。非常感谢您的帮助。
抱歉，这是一个测试版本，用于处理一些缩放问题 - 处理低概率值如何支配优化成本函数。我们可以在另一个线程上与很棒的人讨论这个问题，但上面的图像是使用 -np.sum(np.log(yPred)) 创建的。我编辑了我的答案:)